[00:00:49] Analytics-EventLogging, Analytics-Kanban, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1094294 (Nuria) [00:01:07] mforns: ok [00:05:14] Analytics-EventLogging, Analytics-Kanban, operations: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1094305 (mforns) a:mforns [00:06:32] mforns: in varnishcsa the relevant bit is http://sourcecodebrowser.com/varnish/2.0.3/lib_2libvarnishapi_2shmlog_8c_source.html#l00352 I think [00:06:48] which reads the length straight from the memory record [00:06:57] which stores it on two bytes [00:07:10] so a limit of 1014 seems unlikely [00:07:18] tgr, aha [00:16:51] tgr: man, how did found that piece of code so fast. [00:17:52] milimetric, ottomata, you see why Momofuk Ando hit the top views? [00:18:12] nuria, BTW, I'm investigating the GettingStarted/GuidedTour button click issue. [00:18:29] superm401: many thanks [00:35:54] good night everyone, see you tomorrow! [01:45:10] Analytics-EventLogging, MediaWiki-extensions-Sentry, Multimedia: Log EventLogging schema validation errors in Sentry - https://phabricator.wikimedia.org/T90083#1094638 (Tgr) [04:18:57] Analytics-EventLogging: A bunch of GuidedTourButtonClicksNotValidating - https://phabricator.wikimedia.org/T91412#1094789 (Mattflaschen) a:Mattflaschen [04:20:07] Analytics-EventLogging, MediaWiki-extensions-GuidedTour: A bunch of GuidedTourButtonClicksNotValidating - https://phabricator.wikimedia.org/T91412#1082305 (Mattflaschen) [04:35:30] Analytics-EventLogging, MediaWiki-extensions-GuidedTour, Patch-For-Review: A bunch of GuidedTourButtonClicksNotValidating - https://phabricator.wikimedia.org/T91412#1094804 (Mattflaschen) I haven't tested the old versions, but I believe the regression was introduced [here](https://git.wikimedia.org/bl... [05:30:56] Analytics-General-or-Unknown, Possible-Tech-Projects: Pageviews for Wikiprojects and Task Forces in Languages other than English - https://phabricator.wikimedia.org/T56184#1094839 (NiharikaKohli) @Capt_Swing ping. Do you think this task has the volume of work and complexity suitable for a 3-month GSoC/Ou... [06:16:22] Analytics-General-or-Unknown, Possible-Tech-Projects: Pageviews for Wikiprojects and Task Forces in Languages other than English - https://phabricator.wikimedia.org/T56184#1094902 (Doc_James) By the way analysis by Andrew West per this publication has determined that what medical content people look at v... [06:18:45] Analytics-General-or-Unknown, Possible-Tech-Projects: Pageviews for Wikiprojects and Task Forces in Languages other than English - https://phabricator.wikimedia.org/T56184#1094906 (Doc_James) Also would love to see mobile added to the popular page tool. Currently it is only desktop views. Mobile now is o... [10:47:07] (PS1) QChris: Make custom file ending optional for thumbnails in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194843 [10:47:09] (PS1) QChris: Ban dash from hex digits in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194844 [10:47:11] (PS1) QChris: Add basic Java implementation of guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194845 [10:47:13] (PS1) QChris: Add basic shell glue for guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194846 [10:47:15] (PS1) QChris: Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 [10:47:17] (PS1) QChris: Allow guard to ignore failures (based on total count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194848 [10:47:19] (PS1) QChris: Allow guard to ignore failures (based on per-kind count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194849 [10:53:22] (PS2) QChris: Add basic shell glue for guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194846 [10:53:24] (PS2) QChris: Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 [10:53:26] (PS2) QChris: Allow guard to ignore failures (based on per-kind count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194849 [10:53:28] (PS2) QChris: Allow guard to ignore failures (based on total count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194848 [12:03:54] (PS3) QChris: Add basic shell glue for guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194846 [12:03:56] (PS3) QChris: Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 [12:03:58] (PS2) QChris: Ban dash from hex digits in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194844 [12:04:00] (PS2) QChris: Add basic Java implementation of guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194845 [12:04:02] (PS3) QChris: Allow guard to ignore failures (based on per-kind count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194849 [12:04:04] (PS3) QChris: Allow guard to ignore failures (based on total count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194848 [12:04:06] (PS1) QChris: Fail less hard for misrepresented urls in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194855 [12:54:36] Analytics, Analytics-Cluster: Log the X-Cache header in the webrequest logs - https://phabricator.wikimedia.org/T91749#1095420 (faidon) NEW [13:42:36] Analytics-Engineering, Analytics-Wikimetrics: Unable to add a custom cohort user - https://phabricator.wikimedia.org/T91751#1095456 (Chankalun) NEW a:Chankalun [14:42:11] hi milimetric [14:43:02] hey YuviPanda :) heard you've got iPythons for us :) [14:43:09] milimetric: :D yesssssssss. [14:43:16] milimetric: tied to Wiki logins, no less... [14:43:22] that's pretty sweet, I'm not gonna lie [14:43:33] milimetric: and isolated via docker containers, with dumps / replica / persistante home... [14:43:48] milimetric: it’s not puppetized / productionized yet, but OMG IT IS SO AWESOME [14:43:52] oh ok, so now you're just showing off :P [14:43:55] milimetric: I sent halfak several ALL CAPS emails [14:43:56] milimetric: :D [14:44:01] haha [14:44:05] no that's awesome [14:44:15] I’ve a few more kinks to work out.. [14:44:51] yeah, but the docker idea solves all the problems I can think of [14:44:59] and makes this a tool with great potential, good work [14:45:32] milimetric: yup, yup :D need to put up a way to easily publish them as well. this will be a nice complement to quarry [14:45:47] esp. since you don’t need a wikitech account or anything to be able to use them [14:47:09] hmMMMmmmm [14:47:10] http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ [14:47:10] :) [14:47:53] ottomata: niiice :) [14:48:00] ottomata: there’s also R, Julia, etc kernels... [14:48:12] and because it’s docker, there’s also plenty of ways to scale this out.. [14:49:15] imagine a world where data streams flow from all directions, are forked for shaping into iPython notebooks, and joined back to a central stream for public consumption [14:50:34] :D [14:50:45] Quarry is about to hit 2500 individual queries (and more than 10k query runs) [14:52:05] that's awesome [14:55:14] milimetric: :D I should give it more love in some time… raise limits to 20mins instead of 10, and actually publicize it a bit.. [14:55:28] yeah, I think it'd be very useful [14:56:34] (CR) Ottomata: [C: 2 V: 2] Make custom file ending optional for thumbnails in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194843 (owner: QChris) [14:57:07] (CR) Ottomata: [C: 2 V: 2] Fail less hard for misrepresented urls in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194855 (owner: QChris) [14:59:57] (CR) Ottomata: [C: 2 V: 2] Ban dash from hex digits in MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194844 (owner: QChris) [15:00:55] Seems I am coming /back just at the right time :-) [15:02:14] standup interuption! [15:02:14] :) [15:02:22] Ah. I see. [15:02:27] :-D [15:02:45] Anyways ... maybe I can talk you into doing a release today with the above changes? [15:03:05] Because (if possible) I think I should rerun the mediacounts with those fixed. [15:03:09] s/fixed/fixes/ [15:11:45] milimetric: halfak am off for dinner. current users will continue working, but new users will get a permission denied. I’ll fix it when I come back. [15:12:56] np, bon apetit [15:15:53] kk o/ [15:18:59] mforns: http://bl.ocks.org/yuuniverse4444/8325617 [15:19:18] milimetric, oh, nice [15:19:28] it could use some color work and the hover's a bit wonky [15:19:31] but it's a great start [15:19:35] basically exactly what we need [15:19:46] aha [15:19:58] yes, the other day I was looking at this: http://code.shutterstock.com/rickshaw/examples/status.html [15:20:25] mforns: oh yeah, but that's timeseries no? [15:20:44] yes, it seems it would need more adaptation [15:20:59] yes and your example has also the hover [15:21:19] ok, I'll grab that for today [15:21:59] mforns: no, I mean, rickshaw doesn't let you do anything else except for timeseries [15:22:09] oh I see! [15:22:26] yeah, it's the main reason I didn't want to use it [15:22:37] but it's handy for the simple timeseries stuff [15:22:49] milimetric, aha [15:22:58] ok, thanks for the idea [15:23:21] if I get stuck, I'll ping you :] [15:24:39] (CR) Ottomata: [C: 2 V: 2] Add basic Java implementation of guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194845 (owner: QChris) [15:41:32] nuria, those jobs timed out waiting for some data to exist i am rerunning them now [15:41:56] i wonder if the timeout is too low for daily jobs, since they often get instantiated before the full day exists [15:45:10] (PS1) Ottomata: Set timeout to -1 for mobile apps jobs. These operate on long periods of data (daily, monthly) [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 [15:45:55] ottomata: but i think that is when you switched the cluster right? [15:46:04] ottomata: so there was "backlog" [15:46:16] ottomata: they run for almost a month w/o issues [15:46:22] right? [15:47:17] nuria: They also timed out before. At least some of them. [15:47:30] qchris: ah yeah? [15:47:41] (CR) Ottomata: "Hey yalls," [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 (owner: Ottomata) [15:47:53] yup. [15:48:00] nuria, i don't think they had [15:48:06] qchris: did not know that, so then maybe we need to amp timeouts [15:48:19] i think i reran them before too, from when I had to wrangle a bunch of jobs after the cluster upgrade [15:48:46] the 25th is after I upgraded the cluster, i thought maybe our cluster resource contention issue might have contributed, but idon't htink so [15:48:54] timeout was set at 5 hours after instantiation [15:49:01] i think for daily and monthly this won't work [15:51:46] nuria: 0001459-150216211537130-oozie-oozi-C@20 is one of the jobs that timed out. That was from 2015-02-18. [15:51:48] ottomata: but probably cause we need to parametize jobs differently, and count backwards rather than forward when it comes to daily partitions [15:52:20] ottomata: so on jan 5th we execute over data from jan 4th [15:52:35] ottomata: that would make more sense with a 5 hour timeout [15:52:53] hmm [15:53:00] joseph was arguing for that too [15:53:12] i didn't buy it, because i like the fact that nominal time matches the data for which you are running the job [15:53:20] * qchris agrees with ottomata. [15:53:48] ottomata, qchris : ya, i get that is more intuitive [15:53:58] why not just a high or no timeout [15:54:00] ottomata, qchris : soooo... what can we do? [15:54:03] i don't really see why we need a timeout [15:54:06] we can do this: [15:54:12] https://gerrit.wikimedia.org/r/194868 [15:54:13] ottomata: unbounded executions lead to problems [15:54:19] it isn't execution [15:54:21] ottomata: on my humble opinion [15:54:33] it is a timeout on waiting for data to exist [15:54:40] before oozie decides it isn't going to happen [15:55:04] timeout: The maximum time, in minutes, that a materialized action will be waiting for the additional conditions to be satisfied before being discarded. [15:56:49] ottomata: as long as "checking" whether that data exists does not take up much resources on the cluster [15:57:07] ottomata: cause oozie will be checking for a longer period [15:57:37] i mean, it just looks for existence of a SUCCESS file [15:57:50] for each of its datasets [15:59:15] ottomata: ok, if you and qqchris agree that is a good compromise let's do it [15:59:19] sorry qchris [16:02:53] so qchris, convince me that it is good to have a top level directory called guard that contains shell scripts :) [16:03:14] :-) [16:03:25] Is there a better place for the shell script wrappers? [16:03:49] They certainly do not belong one of the maven directories. [16:03:59] (So refinery-{tools, hive, ...} is out) [16:04:25] not in resources/ [16:04:25] ? [16:05:09] You mean ... resources in the top level, or refinery-tools/resources ? [16:05:12] src/resources/ [16:05:15] yeah [16:05:30] the main stuff maybe in refinery-core/src/resrouces [16:05:38] and the specific guards with their own projects [16:05:39] Ah. No. Those scripts are just external tooling. The refinery-tools jar can live without them. [16:05:40] like tests are [16:05:49] resources get packed in jars? [16:05:54] test resources go there, no? [16:05:56] Not necessarily. [16:06:16] But regardless ... we do not want the scripts in the jar, or do we? [16:06:19] hm, maybe they shoudl be part of refinery instead of refinery-source? [16:06:25] hmm [16:06:26] hmm [16:06:29] hm. not sure [16:06:30] maybe not. [16:06:38] I would keep them in refinery-source. [16:06:39] it is specific for source [16:06:40] yeah [16:06:43] the guard classes are there [16:06:44] hm [16:06:46] Right. [16:06:51] That's the argument. [16:07:06] wait, why not resources? because of jar? they don't get packed in the jar, do they? do the test resources get packed in the jar? [16:07:30] refinery-core/src/test/resources/ [16:07:31] One can tune what gets packaged into the jar. [16:07:33] GeoIP2-City-Test.mmdb GeoIP2-Country-Test.mmdb access_method_test_data.csv isCrawler_test_data.csv pageview_test_data.csv x_analytics_test_data.csv [16:07:36] aye [16:07:56] But refinery-tools/.../resources feel wrong, because that [16:08:14] direcotry would hould resources that tie to refinery-tools jar. [16:08:25] But the shell wrappers are decoupled from the jar. [16:09:04] hm, put it in refinery-tools/src/main/bash ? [16:09:06] Also ... no one would find them if we hid them underneath refinery-tools/... [16:09:17] true [16:09:26] hmmMMMm [16:09:43] I'd only keep Java stuff (and really mandatory resources) in refinery-tools/... [16:09:53] haha [16:10:03] that is what you said about refinery/source in general [16:10:06] ottomata, nuria : I ran january unique_monthly, and there almost twice the number of iOS [16:10:08] and that is why we have two repositories [16:10:13] in comparison to feb [16:10:50] True. But the separation between refinery and refinery/source is still valid. [16:10:53] I checked duration as well: 1:03 [16:11:04] For daily query, it was 3minutes [16:11:04] Even if there is some shell scripting in refinery/source. [16:11:17] joal: did you look at the e-mail from mobile folks? I was about to do that now so i can answer before they get to the office [16:11:19] I think we could deploy that as well today if you wish :) [16:11:25] For me, running the guard ties way more to the sources than how to create Hive tables. [16:11:28] I have seen it yes [16:11:49] joal: same query with vastly different results in both months suggests data loss (if IOS team hasn't changed anything) [16:11:57] ottomata: and the guards can run completely outside of a refinery setup [16:12:07] I am going to double check numbers with the nem definition for sections [16:12:53] aye, qchris, but we don't deploy refinery/source [16:13:23] qchris: do you intend for this to be automated, or to run it manually occassionly? [16:13:46] ottomata: True. But a simple checkout (outside of the cluster. can be plain fs on any machine) will do. './run_all_guards.sh --rebuild-jar' in a cron will do the rest [16:13:56] ottomata: Automatically. in a cron. [16:14:13] failed output emailed i suppose? [16:14:29] joal: ok, looking at adam's e-mail now [16:14:30] ottomata: Last time you said that it's ok if you get those emails. Yes. [16:14:37] yes, i kinda remember :) [16:15:00] ottomata: But I am not sold on it. If you have better suggestions ... let's hear them. [16:15:28] ottomata: The shell scripting decouples this on purpose, so one can switch from cron+email to whatever one likes. [16:15:37] aye [16:15:41] (CR) Joal: [C: 1] "I think it's a good idea. I would also monitor automatically the date of most ancient waiting job -> it can delay everything for a given j" [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 (owner: Ottomata) [16:21:14] qchris: do these scripts depend on the cwd from which you are launching them? [16:21:15] joal: ok, our queries should pick up data just fine so (query -wise) i cannot find a reason why data should differ greatly between jan and feb .Were android results very different also for january? [16:21:32] ottomata: no. [16:21:59] ottomata: (At least I tried hard that they do not rely on cwd. If they fail for a certain cwd ... that's a bug) [16:22:33] nuria: Last email from Dan suggests having sections=all [16:22:42] I am trying it now :) [16:23:12] joal: no need [16:23:18] ah ? [16:23:21] joal: we only use [16:23:47] joal: ah no, wait [16:24:07] :D [16:24:17] trying it right now [16:24:23] joal: me -> read too fast [16:24:29] np [16:24:45] joal: i think we should remove the "sections" entirely, we are -after all- counting "distinct" [16:24:50] joal: right? [16:25:32] qchris: [16:25:33] I am not sure to fully understand why it was here in the first place, so I am not sure either if removing it is a good idea :) [16:25:33] reset_guard_arguments() { [16:25:33] javascript:; [16:25:33] GUARD_ARGUMENTS=() [16:25:38] can't you just do [16:25:42] unset GUARD_ARGUMENTS [16:25:42] ? [16:25:54] nuria: I am gonna double check numbers with and without [16:26:48] qchris: GUARD="$(basename "$(pwd)")" [16:26:48] ? [16:26:48] ottomata: Yes, one could. But we want GUARD_ARGUMENTS to be an array. So if we unset it, [16:26:53] oh ok [16:26:55] got it [16:27:00] ottomata: we'd have to check whether or not GUARD_ARGUMENTS got initialized (upon adding arguments). [16:27:28] joal: teh thing is that parameter is to distinguish used initiated requests vs not [16:27:32] ottomata: The "basename "$(pwd)"" is just "convention over configuration" [16:27:45] joal: but since we are counting distinct appinstallids it doesn't matter [16:27:46] ottomata: So naming the directory will choose the right Guard class. [16:28:06] ottomata: So e.g.: in https://gerrit.wikimedia.org/r/#/c/194847/ [16:28:23] ottomata: the directory is called MediaFileUrlParser, hence it will [16:28:24] nuria: ok [16:28:34] ottomata: pick the MediaFileUrlParserGuard . [16:28:35] but pwd means you'd ahve to be cded into that dir? [16:28:40] I am still going to double check, it doesn't cost much [16:28:53] tools/common.inc takes care of that. [16:29:22] Sorry. That was wrong. [16:30:11] ? [16:30:12] No it was right :-) [16:30:26] tools/common.inc takes care of "cd"-ing to the script's directory. [16:30:31] cd "$(dirname "$0")" [16:30:39] (CR) Ottomata: Add guard for MediaFileUrlParser (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 (owner: QChris) [16:31:04] ??? [16:31:14] oh because common is included from the script in the subdir [16:31:14] uhh [16:31:15] $0 is the name of the script [16:31:24] right [16:31:40] no, its not though, itt isn't included, it is run [16:31:45] https://gerrit.wikimedia.org/r/#/c/194847/3/guard/MediaFileUrlParser/run_guard.sh [16:31:52] ../tools/run_guard.sh [16:31:55] that runs the top level run_guard [16:32:02] oh, that reminds me [16:32:06] * Ironholds whaps qchris [16:32:10] Let's dissect that command :-) [16:32:11] won't $0 be tols/run_guard.sh [16:32:12] haha [16:32:13] oh man [16:32:16] commit messages should explain the desired outcome, not just the script name! [16:32:17] :P [16:32:35] I woke up to a dozen gerrit emails that were like "implement a guard" "implement a guard...in Java" [16:33:05] Ironholds: They do! [16:33:18] ottomata: $0 is the script name [16:33:22] ja, qchris, i might ask something simliar. could you add a big ol README in guard/ about 1. how to use, and 2. how to implement new guards? [16:33:27] i think i'm having trouble following [16:33:35] since it is all about relative imports and directory names [16:33:36] joal: just sent e-mail to dan to understand this better [16:33:40] ottomata: like "guard/MediaFileUrlParser/run_guard.sh" [16:33:46] Thx Nuria [16:33:50] testing qchris... [16:34:06] ottomata: dirname "$0" is then "guard/MediaFileUrlParser" [16:34:36] k. I'll add a README. [16:34:45] ja qchris [16:34:59] [:/tmp] 1 $ cat f1.sh [16:35:00] ./f2.sh [16:35:00] [:/tmp] $ cat f2.sh [16:35:00] echo "\$0 is $0" [16:35:03] ag, will make gist [16:35:33] ottomata: so 'cd "$(dirname "$0")" ' cds to the directory of the script [16:35:56] https://gist.github.com/ottomata/26b69544497d8432c394 [16:36:02] * qchris looks [16:36:10] oh [16:36:11] sorry [16:36:14] that shows that you are right [16:36:15] weird [16:36:22] $0 doesn't get reset when the script runs another script? [16:36:31] oh [16:36:32] yes it does [16:36:33] sorry [16:36:34] ha [16:36:34] yesh [16:36:40] inside of f2.sh [16:36:43] $0 is always f2.sh [16:37:05] Now I am getting lost in what you wanted to say. Sorry. [16:37:09] ok [16:37:13] guard/MediaFileUrlParser/run_guard.sh [16:37:14] does [16:37:18] ../tools/run_guard.sh [16:37:35] "does" means "is a link to" [16:37:39] ? [16:37:41] that means [16:37:46] runs, or executes [16:37:52] OH [16:37:58] OH [16:38:00] it is symlnk. [16:38:00] doh [16:38:01] ok. [16:38:09] i just saw the content in the gerrit change [16:38:13] which shows the path to the file [16:38:18] Ah. True. [16:38:20] which is the same as executing it in a shell script [16:38:21] haha [16:38:21] ok ok [16:38:22] got it [16:38:26] https://gerrit.wikimedia.org/r/#/c/194847/3/guard/MediaFileUrlParser/run_guard.sh [16:38:52] Hahaha. True. That diff /is/ misleading. [16:38:54] joel: will be here, let me know what you find [16:38:55] it does look like you committed a file with the contents ..tools/run_guard.sh [16:38:55] haha [16:39:13] The "Type: Symbolic Link" on the far right is ... well it's invisible. [16:39:20] I had to look for it too to find it. [16:40:04] nuria: https://phabricator.wikimedia.org/P366 [16:40:10] Sounds like we can remove :) [16:40:14] And recompute [16:40:51] joal: that actually makes a lot of sense right? specially when counting distinct ocurrences [16:40:54] hmmm [16:40:59] qchris, the link does the same thing though, no? [16:41:12] ls -l sub/ [16:41:17] link.sh -> ../f1.sh [16:41:33] $ sub/link.sh [16:41:33] $0 is ./f2.sh' [16:42:02] joal: let's wait to see what dan answers to the e-mail just in case and we can re-run jan/feb daily/monthly, right? [16:42:36] nuria: Yes sure :) [16:42:42] Analytics, MediaWiki-Core-Team, Wikimedia-Site-requests: Ran out of captcha images - https://phabricator.wikimedia.org/T91760#1095716 (Nemo_bis) Note, captchas were made a lot harder by df4806c64c48c2cd2cee063611b3193a47c069c8; side effects of new generation FancyCaptcha images have yet to be assessed. [16:42:47] joal: many thnks for looking into this [16:42:59] nuria: But we could have had mobile sessions without sections IN (0, all) no ? [16:43:01] ottomata: Sorry ... I guess I still don't get the issue. [16:43:05] ha, ok [16:43:22] nuria: no problemo, that part af the job, isn't it ? [16:43:25] MediaFileUrlParser/run_guard.sh -> ../tools/run_guard.sh [16:43:31] right. [16:43:46] joal: without this line entirely you mean" AND uri_query LIKE('%sections=0%')" [16:43:47] you still manage the communication with Dan, so it's easier for me :) [16:44:00] then [16:44:02] source "$(dirname "$0")/../tools/common.inc" [16:44:08] from tools/run_guard.sh [16:44:25] right. [16:44:42] nuria: one run as it now, onr with sections in (0, all), one with no check on section parameter [16:44:50] then there are uses of $(pwd) [16:44:53] and [16:45:00] $0 [16:45:04] joal: ok [16:45:53] Removing the check prevents us no to count mobile sessions that wouldn't access sections 0 or all (if it even exists) [16:46:03] in my test at least, $0 will be tools/run_guard.sh [16:46:06] nuria: --^ [16:46:07] ottomata: $0 in all instances should be "the command that was used to invoke run_guard.sh" [16:46:13] ottomata: right. [16:46:54] joal: right, which might be "not user initiated requests" [16:47:30] joal: which we should not do on a mobile connection as it east bandwidth, so .. [16:47:43] joal: that is why i was waiting for dan to answer [16:47:47] nuria: if you say so :) [16:47:48] sorry what? [16:47:50] nuria: ok [16:47:52] no prob [16:47:56] naw, $0 is the current exectuting script [16:48:01] file [16:48:08] I prepare a new patch for the daily refacto [16:48:32] hmmm [16:48:33] including 'sections in ('0', 'all') [16:48:36] nuria: what were you waiting on my answer for? [16:48:43] maybe i am wrong ( i usually am when arguing with qchris) [16:48:58] milimetric: "mobile-dan" not "analytics-dan" [16:49:03] sry, k [16:49:25] ottomata, hive question? [16:49:38] qchris i think in my test my symlink still pointed at a file that exectued another file [16:49:38] sorry. keyboard died. [16:49:41] why the heck do column names always come out as table.col_name now? I swear that didn't used to happen [16:49:45] ungh, i dunno whatever qchris, i trust that you are right on this one :) [16:49:53] anwyay, yeah, right a readme so I can follow better [16:50:04] hive 0.13? [16:50:05] maybe Ironholds? [16:50:06] dunno [16:50:14] comments on columns now also work [16:50:20] they did not before [16:50:34] huh [16:50:36] thanks! [16:50:38] ottomata: yup. A readme you'll get. [16:51:03] (CR) Ottomata: "Hopefully the table can be populated manually by futzing with the data files :)" [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [16:51:05] In the meantime ... adding a guard is as simple as the 61 lines of https://gerrit.wikimedia.org/r/#/c/194847/. [16:51:24] (CR) Ottomata: "Joal, I'm going to step back from reviewing this one,a nd let you and nuria settle it. I'm sure it will be good :)" [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [16:51:52] (CR) Joal: "Agreed with ottomata, except that since definition changed, a re-run might be needed" [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [16:58:40] ok ottomata ;) [17:11:15] Analytics, Patch-For-Review: Configure CORS on datasets.wikimedia.org - https://phabricator.wikimedia.org/T91532#1095865 (Milimetric) Open>Resolved a:Milimetric Dario - if you put up some TSVs, I can see if anything else is wrong with my little adhoc thingy :) [17:16:35] Analytics-Kanban, Patch-For-Review: Analyze different types of users in the context of Edit Schema events {lion} - https://phabricator.wikimedia.org/T89729#1095897 (Milimetric) Open>Resolved [17:33:19] cd .. [17:33:34] Who stole my keyboard focus? [17:37:10] (PS4) QChris: Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 [17:37:12] (PS4) QChris: Allow guard to ignore failures (based on per-kind count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194849 [17:37:14] (PS4) QChris: Allow guard to ignore failures (based on total count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194848 [17:37:16] (PS1) QChris: Add a README for the guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194901 [17:37:54] (CR) QChris: Add guard for MediaFileUrlParser (1 comment) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 (owner: QChris) [17:39:20] ottomata: regardless of the guard framework ... any chance you could do a refinery/source release and jar update, so I could rerun the mediacounts jobs over the weekend? [17:46:30] sure! [17:46:35] Awesome! [17:46:40] if I do release, you can do deploy, ja? :) [17:46:47] Sure. [17:47:15] But I cannot upload to archiva ... so that would need yourmagic hands too. [17:48:55] ja i will do that [17:49:00] qchris, this is a good readme, thank you [17:49:13] cool! [17:49:15] you win a guard merge! :) [17:49:29] (CR) Ottomata: [C: 2] Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 (owner: QChris) [17:49:36] (CR) Ottomata: [V: 2] Add guard for MediaFileUrlParser [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194847 (owner: QChris) [17:49:48] Whoa! I win! Yippie! [17:49:54] (CR) Ottomata: [C: 2 V: 2] Allow guard to ignore failures (based on total count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194848 (owner: QChris) [17:50:10] (CR) Ottomata: [C: 2 V: 2] Allow guard to ignore failures (based on per-kind count) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194849 (owner: QChris) [17:50:23] (CR) Ottomata: [C: 2 V: 2] Add a README for the guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194901 (owner: QChris) [17:50:55] (CR) Ottomata: [C: 2 V: 2] Add basic shell glue for guard framework [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194846 (owner: QChris) [17:51:53] (PS1) Ottomata: Update changelog in preparation for 0.0.8 release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194904 [17:52:04] (CR) Ottomata: [C: 2 V: 2] Update changelog in preparation for 0.0.8 release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/194904 (owner: Ottomata) [18:04:12] joal: let's remove the section thingy from the mobile query and repeat monthly counts, sounds good? [18:06:23] nuria: while at it, I suggest do the same to daily, and repeat daily as well ! [18:06:31] joal: sounds great [18:06:35] :) [18:06:38] Let's go :) [18:06:46] joal: thanks many for doing these changes [18:06:54] No problem :) [18:11:21] (PS2) Joal: Refactor mobile_apps_uniques_daily to match newly baked monthly. [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 [18:11:47] nuria: code review ;) [18:11:58] joal: looking [18:12:22] let's discuss the change I made on monthly [18:12:32] WXant to be sure you think it's ok ;) [18:13:28] (CR) Nuria: [C: 1] Refactor mobile_apps_uniques_daily to match newly baked monthly. [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [18:13:56] joal: +1, i leave merges up to ottomata so he bundles those with deployment [18:14:43] ok nuria, thx :) [18:20:49] ottomata: let me know when merged, I'll deploy and start the needed jobs [18:22:06] aye, cool, qchris is goign to do a general deploy, ja qchris? [18:22:11] Analytics-Wikimetrics: Labs instances rely on unpuppetized firewall setup to connect to databases - https://phabricator.wikimedia.org/T71042#1096128 (coren) Open>Resolved a:coren The iptables have been obsoleted some time ago as the replica databases were merged. [18:22:12] with new refinery source version ? [18:22:18] (that is uploading now) [18:22:26] so, we can merge this anytime [18:22:31] ottomata: I'll deploy the version to the cluster. Sure. But I'll [18:22:33] joal, you can merge this if you like [18:22:42] qchris: once the jars are in archiva, you can handle it? [18:22:56] not gonna rerun the uniques jobs. I quit because of those. /me no touchy. [18:22:58] or do you need me to do the refinery artifacts? [18:23:00] aye no [18:23:01] haha [18:23:02] not those [18:23:05] just the deploy [18:23:06] part [18:23:06] :) [18:23:09] Yes. Sure. [18:23:15] (CR) Ottomata: [C: 2] Refactor mobile_apps_uniques_daily to match newly baked monthly. [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [18:23:17] I'll deploy the refinery version. [18:23:22] (CR) Ottomata: [V: 2] Refactor mobile_apps_uniques_daily to match newly baked monthly. [analytics/refinery] - https://gerrit.wikimedia.org/r/194400 (owner: Joal) [18:23:27] nuria: joal, merged. but, you can merge too! [18:23:29] in the futurreee [18:23:43] if you two have commits you have reviewed, and have +1s or +2s, either of you can merge [18:23:54] just refinery deploys are pretty easy, and joal can do those now [18:24:06] refinery-source deploys and updates are a little trickier [18:24:32] ottomata: ok, shall do that going forward [18:24:44] ottomata: I forgot... how do we re-run jobs in the cluster? [18:25:02] ottomata: is removing the output file sufficient? [18:27:30] nuria: depends on how you need to rerun [18:27:38] if you just need to rerun the oozie job, then [18:27:47] oozie job -rerun -action if you need to reload the job and re run [18:27:56] ottomata: ahhh ok [18:27:59] then you probably need to kill the oozie job [18:28:00] and resubmit [18:28:27] joal knows how to do this :) [18:28:34] joal is a knowledgable kinda guy :) [18:28:57] ottomata: jaja but if we all know that would be best i think, note taken [18:32:31] yup :) [18:32:42] cool, qchris, refinery/source release done. [18:32:47] 0.0.8 should be in archiva now [18:32:55] Cool. Thanks. [18:33:08] Ironholds: note that qchris is about to deploy your pageview changes. this will not change the pageview def currently running [18:33:13] unless someone restarts the oozie job [18:33:16] Argh. Dinner. I'll deploy after dinner then. [18:33:17] we can do that whenever we are ready [18:33:19] k! [18:33:21] no worries! [18:34:24] ottomata, coolio [18:38:51] ok, gotta run yall, we are taking aaron and kevin for a philly sight seeing tour and then to the airport [18:38:54] lateerrrsss [18:41:07] fun! [18:41:57] Analytics: find out what browsers Wikimedia projects editors use - https://phabricator.wikimedia.org/T78539#1096287 (Amire80) Open>Resolved a:Amire80 This is pretty much resolved by http://datavis.wmflabs.org/agents/ . I do hope that it will be updated regularly ;) Thanks, @Ironholds! [18:42:07] Analytics: find out what browsers Wikimedia projects editors use - https://phabricator.wikimedia.org/T78539#1096291 (Amire80) a:Amire80>Ironholds [18:46:03] Analytics: find out what browsers Wikimedia projects editors use - https://phabricator.wikimedia.org/T78539#1096307 (Ironholds) As explained, that will not be updated regularly. If there's a pressing need for regular updates Analytics Engineering should build a system. ....Actually I could probably build a... [18:46:08] nuria, so I have an idea. [18:46:18] wmf.webrequests contains is_pageview now, yuss? [18:46:38] and it contains the user agent and the source [18:47:48] so if we had an oozie job that just ran something like SELECT * FROM (SELECT os,os_major,browser,browser_major, COUNT(*) AS pageview_count FROM wmf.webrequest_source WHERE is_pageview = 'true' GROUP BY os, os_major, browser, browser_major) HAVING pageview_count > 500 ORDER BY pageview_count DESC; [18:47:59] (oh, and webrequest_source) [18:48:07] and then threw it at the public folders... [18:48:23] we could really trivially plug it into the agents exploratory tool [18:48:26] and this wouldn't be much work at all [18:48:53] *500000 [18:53:46] Ironholds: reading [18:54:30] Ironholds: seems that it would be better to process the UA when the geo coe info is processed [18:54:35] *geo code [18:54:48] Ironholds: that is , at the time of creating refined tables, every hour [18:54:53] * joal agree with nuria [18:55:01] Ironholds: so teh refined table never has a "raw" user agent [18:55:06] *the [18:55:31] Ironholds: that is easyly done (and ahem, that is what we created refined tables to start with) [18:55:53] Ironholds: and after over those records we can define -easily- an agreggation strategy as we see fit [18:56:01] Ironholds: this is a pretty short task [18:57:15] nuria, agreed! [18:57:22] we should make otto do that :D [18:57:27] oh, hmn. [18:57:33] Ironholds: I'll do it :) [18:57:37] joal, okie! :D [18:57:47] I'm not sure how that'll play with our additional ua-parsers though. [18:57:49] Analytics-Cluster, Analytics-Kanban: Add processed user agent to refined tables - https://phabricator.wikimedia.org/T91793#1096418 (Nuria) NEW [18:57:55] Ironholds, nuria: Me like automating jobs ;) [18:58:10] Ironholds, joal: just filed task: https://phabricator.wikimedia.org/T91793 [18:58:14] I guess if we want to factor those in we can just go WHERE parsed_agent['device'] == 'Spider' OR other_udf(agent) = true [18:58:18] nuria, awesome! Thanks! :D [18:58:41] I'll noodle on how to least-obnoxiously make my application ask for it. [18:58:50] Ironholds: " i do not get this 'I'm not sure how that'll play with our additional ua-parsers though'" [18:59:19] Ironholds: Can you please just add the udf name as a comment in the task ? [19:00:20] joal, the non-ua-parser UA parser? [19:00:29] oh, we'll just not factor that in; it outputs a boolean anyhoo [19:00:53] nuria, so, there are Wikimedia-specific crawlers. We have a UDF just for detecting those. But it outputs a boolean anyway so we can just include that as and when we need it [19:01:08] hmmm [19:01:30] I'll have a look at the udfs when merged and will get back :) [19:01:40] Ironholds: but that goes on the "spider" column does it not? [19:02:05] Ironholds: Let's try to separate concerns. [19:02:15] yeah, agreed [19:02:17] I just said we should :D [19:25:36] nuria: Shall I go for merging andrew request on timeout for daily and monthly jobs ? [19:28:24] joal: sounds good [19:28:45] cool, will rebase, submit, merge and run the new jobs :) [19:29:49] (PS2) Joal: Set timeout to -1 for mobile apps jobs. These operate on long periods of data (daily, monthly) [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 (owner: Ottomata) [19:30:53] nuria: could review the last patch ? [19:33:43] joal: samethan patch #1 plus rebase right? [19:33:51] Correct [19:33:58] nuria: --^ [19:34:25] milimetric, yt? [19:34:32] (CR) Nuria: [C: 2] "Sounds fine. +2, let's keep an eye to make sure queue of "jobs waiting" does not get huge." [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 (owner: Ottomata) [19:36:30] nuria: I can't find a way to verify the request ... Do I need a specific right? [19:36:55] joal: wait.. what request? [19:37:08] nuria: same again [19:37:40] joal: ay ay .. me no compredou [19:37:44] Tells:) [19:37:51] So, I want to merge this request [19:38:11] nuria: In order to do that, gerrit tells me the request needs to be verified [19:38:16] (CR) Nuria: [V: 2] Set timeout to -1 for mobile apps jobs. These operate on long periods of data (daily, monthly) [analytics/refinery] - https://gerrit.wikimedia.org/r/194868 (owner: Ottomata) [19:38:22] joal: ahhh [19:38:41] just did: +2 verfied, +2 CR and "publish and submit" [19:38:48] joal: should be merging by now [19:38:52] k [19:39:00] It seems I don't have the right to do that [19:39:01] joal: ok, we are good [19:39:06] Perfect :) [19:39:15] nuria: Thx :) [19:39:31] I'll ask andrew why I don't have the right to give a +2 ;) [19:39:42] joal: right, that is some gerrit config [19:40:08] nuria: ok, I go and deploy, then restart job [19:41:33] joal: k excellent [19:44:06] (PS1) QChris: Add 0.0.8 refinery jars and update symlinks [analytics/refinery] - https://gerrit.wikimedia.org/r/194931 [19:44:08] (PS1) QChris: Bump refinery version to 0.0.8 for mediacounts [analytics/refinery] - https://gerrit.wikimedia.org/r/194932 [19:44:36] qchris: Shall I change my param for jar version in jobs ? [19:45:15] qchris: Or maybe ask you ? [19:45:17] I did not drop the 0.0.7 jars, so it's fine to not increase them. [19:45:36] ok [19:45:39] But ... [19:45:47] Since you are at it ;) [19:45:47] Would you want to merge the above two changes? :-D [19:46:19] I'd love to, but unfortunately it seems I don't have right to give +2s :( [19:46:25] Ahhhhh, sadness :) [19:46:35] You don't? [19:46:39] Nope [19:46:42] Let me check ... [19:46:54] Sure [19:47:27] Do you have +2 now? [19:48:28] qchris: in the bump commit, would you also bump oozie/webrequest/refine/bundle.properties refinery_jar_version? [19:48:31] Let me check [19:48:40] Typically, we don't. [19:48:45] huhu [19:48:51] Bringing in new jars is one step. [19:48:52] I wanna learn :) [19:48:58] Upgrading jobs is another one. [19:49:13] That way, we're not forced to upgrade all at once. [19:49:20] Hence upgrades cause less side effects. [19:49:32] ok, I understand [19:49:37] E.g.: The glam tsvs are still an 0.0.5. [19:49:38] Let's not change [19:50:54] joal: You're not in the ldap group 'wmf' . Being in that groups grants access to many things. Like icinga, graphite, ... [19:51:07] wmf employees are typically in that group. [19:51:13] You might want to be in it too :-) [19:51:17] look for jallemandou [19:51:40] jallemandou is not in that group either [19:51:47] Arrf :) [19:51:55] I do have +2 now :) [19:52:07] What was the issue ? [19:52:22] The real issue is that you're not in the wmf ldap group. [19:52:24] (CR) Joal: [C: 2 V: 2] Add 0.0.8 refinery jars and update symlinks [analytics/refinery] - https://gerrit.wikimedia.org/r/194931 (owner: QChris) [19:52:28] joal did not know that qchris is THE MASTER of gerrit ta -tachannnnnn [19:52:44] But for now, I added you to the analytics group in gerrit. [19:52:51] * joal still has lot to learn ! [19:53:08] Gerrit is the devil. Gerrit does not have masters :-) [19:53:50] (CR) Joal: [C: 2 V: 2] Bump refinery version to 0.0.8 for mediacounts [analytics/refinery] - https://gerrit.wikimedia.org/r/194932 (owner: QChris) [19:54:10] You're merged :) [19:54:21] qchris: --^ [19:54:32] joal: Yay! [19:54:34] Thanks. [19:54:35] Awesome. [19:55:55] qchris: andrew told me that deploys for refinery/source were trickier than for refinery ... is that correct ? [19:56:14] Yes. [19:56:39] ok, so you'll to wait for him for a deploy ! [19:56:48] So for refinery, one "only" needs plain deployment access [19:56:56] k [19:57:03] But for the refinery/source deploy, you'd also need access to achiva. [19:57:09] I guess only ottomata has that. [19:57:13] ok [19:57:22] I mean ... ottomata has ... and all ops can force access :-) [19:57:34] makes sense [19:58:31] YuviPanda: We have a lovely wmf employee that is not yet in the wmf ldap group ... does that need a proper ticket, 3 days wait etc etc, or can he get added right away? [19:59:11] qchris: I think wmf group can be done right away provided someone confirms it’s an employee (wmfall email / email from manager) [20:00:04] His account has an "@wikimedia.org" email address. Is that ok too? [20:00:23] hmm, I’m not sure. [20:00:33] I mean ... he is an employee :-) [20:00:55] Ok. Sorry. I do not want to cut processes short. [20:00:59] We'll file a ticket. [20:01:01] I don’t think there’s a process for this [20:01:07] Sorry for the peer pressure. [20:01:11] hehe [20:01:24] qchris: I think Otto should be able to just add him. [20:01:39] qchris: also, I’m also partly peer-pressuring directors / managers to notify wmfall of newhires… :D [20:02:07] I am sure there was a wmfall email. Let me check if I was still on the list when the email flew by. [20:04:37] Meh. tnegrin's announcement was only to the public analytics list. Not wmfall. [20:05:09] Actually there was: From tnegrin, Feb 19 [20:06:40] joal: aha! [20:06:42] indeed there is [20:06:43] Mhmm ... I guess I unsubscribed before that then. [20:06:46] Cool. [20:07:03] No prob, Thwx qchris [20:07:31] joal: alright, so what’s your LDAP name? I’ll add you to the group... [20:07:44] JAllemandou_(WMF) [20:07:49] Nope. [20:07:51] It's joal. [20:07:57] uid=joal,ou=people,dc=wikimedia,dc=org [20:07:57] Really ? [20:08:03] wow ... [20:08:14] Ldap name comes with wikitech then :) [20:08:16] Also sn, cn,. [20:08:42] Yes. Ldap is wikitech. [20:08:46] k [20:09:51] Got to go for FOOD ! [20:09:58] See y'all tomorrow :) [20:09:58] Bon Appetit! [20:10:09] Merci :) [20:10:28] nuria: Launched jobs from 2015-01-01 for both daily and monthly [20:10:36] Will check regularly on execution [20:10:48] joal|night: all right, will check in couple hours and report, good nite [20:11:46] qchris: joal|night you have been added to the wmf ldap group [20:12:03] YuviPanda: Awesome! Thanks. [20:12:06] qchris: thanks for poking :) [20:12:07] * qchris hugs YuviPanda [20:45:36] Thx YuviPanda :) [21:05:28] nuria, joal|night: The backfilling jobs around uniques that you started around 20:00 are effectively stalling the cluster. [21:05:34] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&z=large>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&aggregate=1&r=hour [21:06:02] They are hogging ~600MB of currently ~900MB of memory for veeeeery long. [21:06:44] oh, THAT'S why the queries are ganked? [21:06:53] I've been staring at a query at 99% map for the last..lots. [21:07:09] qchris: I see, why is that ? because the number of jobs? [21:07:20] qchris: or because of the *monthly jobs [21:07:28] job_1424966181866_17188 [21:07:47] ^ is the job id that wants 568GB of mem. [21:08:14] The total number of jobs is low. [21:08:26] It's just a single big job thath is starving everything else. [21:08:38] qchris: lemme look at which one that is [21:08:45] Also ... that job is running in the essential queue. Hence, preempting plain user's job. [21:11:52] nuria: This job seems to have finished in the meantime (after starving the cluster forever) [21:12:15] But the next one has already been started: job_1424966181866_17193 [21:13:12] Not even kafka could catch up between the short pause. [21:13:40] qchris: where can you look that up? I was doing 'oozie job -info' on monthly jobs but that is kind of not the best way [21:13:57] Look at the url that I pasted above. [21:14:15] The top left diagram should show a real hump every 10 minutes [21:14:33] The second diagram should go up and down more pronounced. [21:14:41] Like it did before 20:00. [21:14:57] The more flat they are, the less work kafka can do. [21:15:29] To look at memory consumption of jobs, you can run [21:15:33] mapred job -list [21:15:36] on stat1002. [21:16:27] nuria: qchris Ironholds have you seen jupyter.wmflabs.org [21:16:43] * qchris looks [21:17:51] YuviPanda: i do not see anything (blank page after oauth) [21:17:58] Hey. Is that the IPython notebook thing that you discussed some hours ago? [21:18:17] nuria: yeah, there’s a known bug with any usernames that contain non-alpha numeric characters... [21:18:18] qchris: yup [21:18:51] YuviPanda: ah ok, i used my NRuiz (WMF) [21:19:03] YuviPanda: Neat! [21:19:06] nuria: yeah, i could see. try something without ()? [21:19:12] YuviPanda: ahhh ipython [21:19:58] nuria: yup, but with access to dumps (/public), replica dbs, persistant storage, and security / isolation via docker [21:20:17] YuviPanda: docket is verybody's favorite [21:20:22] *docker [21:20:53] YuviPanda is the crazy guy :-) [21:21:01] I guess people will really, really, really love that! [21:21:06] yup, yup [21:21:16] gonna take a while before it’s really useful. it’s not fully stable yet [21:21:19] I need to puppetize this as well [21:21:24] and also throw more instances at it [21:21:30] and have some way of scheduling dockers across [21:24:20] YuviPanda: "To use Connected Apps on this site, you must have an account across all projects. When you have an account on all projects, you can try to connect "Jupyter Hub" again." [21:24:25] YuviPanda: oh well [21:24:29] nuria: oh wow... [21:24:39] nuria: I’ll fix the bug with the usernames tomorrow [21:24:58] YuviPanda: ok, will make sure to look [21:25:15] qchris: looking at graphs [21:26:11] qchris: I see ...so.. how can we do this better? [21:26:26] qchris: the "monthly" jobs look at all refined data for 1 month [21:27:06] Sorry to say ... but uniques is not something I want to dive in. [21:27:14] But in general there are three approaches. [21:27:21] aham [21:27:23] 1. Get more hardware resources. [21:27:38] right [21:27:42] 2. Split the query, and run queries that consume less memory [21:27:58] 3. Make the query faster, so it does not hold onto that much memory for that long. [21:28:29] There are currently 5 unhealthy nodes. Reclaiming them will give us 200GB more RAM. That will help. [21:28:34] But it's not a solution. [21:28:35] qchris: in test, monthly job didn't stale the cluster ... [21:29:06] joal|night: That might be. But it seems it's doing now: http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=analytics1012.eqiad.wmnet|analytics1018.eqiad.wmnet|analytics1021.eqiad.wmnet|analytics1022.eqiad.wmnet&mreg[]=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&z=large>ype=stack&title=kafka.server.BrokerTopicMetrics.%2B-BytesOutPerSec.OneMinuteRate&aggregate=1&r=hour [21:29:21] qchris My guess is that the change we made with nuria has changed the game [21:29:31] Yeah, I've seen [21:29:38] From my point of view opetion 2 (splitting the query) is typically the easiest. [21:29:46] s/opetion/option/ [21:30:03] I've done the same in the past, and now again for the mediacounts files. [21:30:06] joal|night: the removal of "section"? .. mmm i doubt it [21:30:25] joal|night: cause all apps requests put together i do not think they come up to 2% of total traffic [21:30:30] For the mediacounts files, memory consumption decreased a lot. Really a lot. And runtime more than halfed. [21:30:50] qchris: splitting across partition boundaries? [21:31:07] For mediacounts: Yes. [21:31:07] qchris: Like "run half on 1 month" union "the other half" [21:31:20] Not sure if that is that easily possible for uniques. [21:31:26] qchris: yes [21:31:37] There are ways around that ... but since it's uniques ... I do not want to think about it. [21:31:43] :D [21:31:44] qchris but what i do not get... is why that would lower memory consumption [21:32:07] qchris: isn't memory driven by number of records/blocks loaded? [21:32:40] Well ... in splitting, one typically turns (one big query) into: [21:33:03] (split and then sort and reduce on small part of data), (split and then sort and reduce on small part of data), (split and then sort and reduce on small part of data) [21:33:15] Followed by a sort/group of those smaller chunks. [21:33:29] That's where the smaller footprint comes from. [21:33:45] One reduces the data size in the intermediate steps. [21:35:40] qchris: ok, this is something we have to look into i guess [21:35:47] I think the "section" clause added a proper reduction by naturally reducing the number of non-distinct lines, facilitating the distinct part of the query [21:36:21] Could that be ? [21:36:24] joal|night: but that number -overall- is tiny compared to the partition size [21:36:37] joal|night: the actual dataset for apps is real, real small [21:38:14] joal|night: Soooo.. not sure.. my inclination would be to say that the removal cannot be the game changer but I might be totally off [21:39:39] Be things as they may ... the jobs blocked the cluster for 1.5 hours now ... [21:39:44] Is it ok if I kill those jobs? [21:39:54] qchris: no, not really [21:40:07] :-d [21:40:21] s/:-d/:-D/ [21:40:53] qchris: cause we need the data for the mobile team (i know, your favorite chunck of data) [21:41:13] Well ... improve your queries then. [21:41:16] qchris: what is the worst case scenario, one monthly job just finished [21:41:24] qchris: and the other is wip right? [21:41:25] I do not want to have to backfill everything again. [21:41:36] qchris: meaning the refined tables? [21:41:44] yup. [21:42:02] qchris: ay , sorry i do not see how these two are related, why does this job have priority over the other? [21:42:11] qchris: that should not be the case [21:42:11] We can pause the jobs, right? [21:42:41] Ironholds: Now would be a good time to chime in :-) [21:42:46] qchris: isn't the top priority job the one that fills in "refined" tables? [21:43:07] Not if the uniques jobs run in the same queue. [21:43:10] But it's ok. [21:43:16] qchris, sorry; what did I miss? *reads up* [21:43:16] I tried to make my point. [21:43:22] qchris: But i thought ottomata just changed that [21:43:36] Ironholds: You missed that people think their queries are more important than others. [21:43:42] qchris: if they are interfering with refined tables, by all means kill them [21:43:44] agreed qchris and nuria, uniques jobs should probably not be in the essential queue [21:43:47] heh [21:43:52] so, on the point about sections=0 [21:44:01] if there's a performance increase from having SOME kind of filter there... [21:44:04] * Ironholds grabs his def [21:44:09] qchris: I assumed they were in the priority #2 queue after otto's changes [21:44:30] Well. I'll let you settle those things. [21:44:31] Ironholds: no, we do not know that, we never run two monthly jobs at teh same time [21:44:32] Ironholds: Nope, no diff [21:44:36] Just checked [21:44:40] okies! [21:44:42] You know the cluster is stalled by those jobs. [21:44:50] I only wanted to help. [21:44:57] qchris: Please do kill them [21:44:57] so, UUID + ?? == stall? Crap. [21:45:05] qchris: if they interfere with backfilling [21:45:21] qchris: We didi not know that was the case (at least i did not ) [21:45:34] I think everythoing is back to normal now [21:45:52] on 3. do we want to move more towards raw mapreduce jobs rather than hive queries for 'infrastructure' data generation? [21:45:56] The concern will be when the feb month will kick in ... [21:46:11] i.e., could we eke out efficiencies by avoiding hive's idiosyncracies? [21:46:32] (I would also like more resourcing and I think we're getting more resourcing, but if we can solve it without 'first, add more machines', great. Machines are expensive and take time to order.) [21:46:58] joal|night: I do not think that things are back to normal already now (without killing). ... The unique jobs still hold the memory and starve the others. [21:47:12] kafka is not picking up either. [21:47:17] e.g.: we could use a streaming model so the only real memory usage is the ongoing grabbing of uniques, and a row at a time? [21:47:26] as Bob West did with the request-chaining stuff? [21:47:41] joal|night, qchris: then ... let's just kill them [21:47:55] k. will kill them. [21:48:09] Ironholds: I think we first need to fix priorities so "business" jobs [21:48:24] *nods* [21:48:30] Ironholds: come after the jobs that fill refined tables. [21:48:34] agreed! [21:48:39] * qchris killed the two jobs [21:48:49] qchris, my job completed! Thank you! :D [21:48:58] qchris: the two monthly ones, right? [21:48:59] only 5.6m rows. Neat. [21:49:04] memory consumption back to normal. [21:49:27] kafka is picking up too. [21:49:46] nuria: I killed 0022831-150220163729023-oozie-oozi-C, and 0022832-150220163729023-oozie-oozi-C [21:50:16] qchris, joal|night : ok, let's revisit Monday how do we want to deal with these, [21:50:33] k [21:50:34] k. [21:50:46] Thx qchris [21:51:14] How can we see when kafka become to suffer ? [21:51:40] joal|night: I use the ganglia graphs. (The URL that I pasted above) [21:51:43] qchris: do we have alarms about this? [21:51:46] That's the easiest. [21:52:12] But you can also navigate to those graphs in ganglia ... or watch hdfs if new files get added in time. [21:52:27] nuria: I do not think so. [21:52:39] nuria: At least I am not aware of such alarms. [21:53:23] See you guys on monday :) [21:53:29] Enjoy your weekend! [21:53:46] qchris: thanks for your work chjris [21:53:53] yw. [21:54:11] sorry qchris [21:55:24] me goes back to the joy of vcl and cookies [21:55:43] :-) [22:09:20] !log starting HDFS balance for unhealty node analytics1016.eqiad.wmnet with healty nodes analytics1037.eqiad.wmnet,analytics1040.eqiad.wmnet [22:25:51] inbox 0 \o/ [22:33:34] * qchris bows to Ironholds! [22:44:39] qchris, how much do you know about how requestlog entries are transmitted to the Hadoop store through kafka? ;) [22:45:12] Not too much. Only know the things on the surface. [22:45:19] But what's the question? [22:45:49] will poke in PM [22:46:03] Whoooooo... Secrecy. Yeah! [22:46:04] :-D