[00:00:32] hey yurik :-) [00:00:43] qchris, hi! [00:00:57] plane in 20, what's up? [00:01:01] the cluster is currently working hard to catch up what it missied during the day [00:01:11] can I kill your zero-sms job for today? [00:01:20] sec [00:01:43] (It's hard-blocking the catchup, as the jobs are fighting for resources, and currently there is a stale mate) [00:01:47] (all jobs blocked) [00:02:08] qchris, sure, go ahead - i was worried that it hasn't finished yet [00:02:19] qchris, i might have to bug you tomorrow to possibly manually restart it ;) [00:02:20] k. thanks. [00:02:44] k. [00:04:12] qchris, did you kill the python script that starts them? [00:04:21] killing myself ... [00:04:22] no. [00:04:36] the hive job just released the resources it held. [00:05:04] (Cluster is slowly recovering) [00:05:33] qchris, just in case - i have a 48 */4 * * * job that starts it [00:05:55] And it automatically re-runs missing periods? [00:06:03] it runs python script that spawns a hive job per each missing day [00:06:14] Awesome! [00:06:16] Thanks. [00:06:19] np [00:06:40] qchris, double check - it could have started another job when you killed the first [00:07:00] I did not yet kill a job. [00:07:05] I only asked. [00:07:15] But before I could kill it ... it freed the resources. [00:07:15] ah, ok - i killed the python script anyway [00:07:23] it got scared [00:07:27] :-D [02:04:55] * Fiona smiles at https://wikimediafoundation.org/wiki/File:Andreescu,_Dan_January_2015.jpg [10:10:18] Analytics-Tech-community-metrics, ECT-February-2015, ECT-March-2015: Key performance indicator: Gerrit review queue - https://phabricator.wikimedia.org/T39463#1069401 (Qgil) [14:09:20] Analytics-Tech-community-metrics, Wikimedia-Git-or-Gerrit, ECT-February-2015: Active code review users on a monthly basis - https://phabricator.wikimedia.org/T86152#1069777 (Nemo_bis) > Why? Self-merges are not code review. The bug summary says "code review". If you're not interested in code review,... [14:10:22] Hi ottomata, yt ? [14:10:28] hiya [14:10:57] You ok this morning ? [14:11:15] 'cause I have code review for yaaa ;) [14:11:19] doing alrigggghhhht! :) [14:11:28] qchris and I are petting the cluster, telling it that everything will be alright [14:11:38] huhu [14:12:04] I don't wanna bother qchris, but I'd like to know how he monitors the fact that our jobs ar late [14:14:40] also, working on the mobile monthly uniques and therefore reading on oozie, I'd like to discuss with you the use of forward/backward frequency counts for datasets in coordinators [14:15:15] sureuuuU!U!! [14:15:22] he monitors because he has datasets he cares about [14:15:24] so he noticed. :) [14:16:04] e.g.: shall we use ${coord:current(0)}-${coord:current(23)} OR ${coord:current(-23)}-${coord:current(0)} [14:17:09] i think the former, since we want the nominal time of the workflow to be the day for which the data is [14:17:10] that is. [14:17:23] Feb 26 hours 0-23 [14:17:30] should generate the fiel for [14:17:32] FEb 6 [14:17:35] Feb 26 [14:17:35] * [14:17:44] (typing hard) [14:18:30] joal: About seeing which datasets are late ... you can run '/srv/deployment/analytics/refinery/bin/refinery-dump-status-webrequest-partitions --datasets all' on stat1002. Once things look wrong there, I typically just look at the oozie output directly. [14:18:52] (as long as someone keeps that file up to date with each dataset...:p) [14:19:15] i think i'm going to revert the vcores change [14:19:30] Thx qchris :) [14:19:31] the change happened the night before the cluster started backing up [14:19:34] I'll have allok [14:20:04] ottomata: so about the cluster deadlock. [14:20:08] ja [14:20:11] It also happened before. [14:20:17] Like once a month or something. [14:20:25] really? and you just managed it and didnt' tell me?! :p [14:21:09] It typically was some random query from some researcher that was too big. So I freed up resources for that and once that query finished backfilled the other stuff automatically. [14:21:33] We discussed memory limitations a few times. [14:22:00] But the issue wasn't as big as it is now, because back then, everything was based on wmf_raw. [14:22:09] ? [14:22:12] So things blocked early, and recovery was easy. [14:22:16] ah [14:22:29] Now with wmf.webrequest, one has to be carefull when to run which job, [14:22:39] because more jobs can run at once? [14:22:40] as otherwise, the refining happens on partial data. [14:22:44] right. [14:22:49] ? [14:22:55] O [14:22:56] H [14:23:02] because 2 hours would pass [14:23:07] and camus wouldn,'t be done. [14:23:09] Right. [14:23:11] that's bad. [14:23:12] hm [14:23:12] ok [14:23:25] hm. [14:23:40] ok, before I try to revert vcores then [14:23:44] i'm going to do some fairscheduler reading [14:23:49] and tweak the queues [14:23:54] cool. [14:23:58] i could just up the priority of essential [14:24:00] but i'll read a bit first [14:29:03] Analytics-Tech-community-metrics, Wikimedia-Git-or-Gerrit, ECT-February-2015: Active Gerrit users on a monthly basis - https://phabricator.wikimedia.org/T86152#1069783 (Qgil) [14:58:16] ottomata: how are things coming along with the queue tuning? Should I start refining now nontheless, so we get a few partitions done until you finished your parameter testing? [14:59:00] sure [14:59:02] they are coming along [14:59:14] actually, fair-scheduler allocations will get picked up wtihout restarting resource manager [14:59:17] i but i want to enable preemtpion [14:59:25] yay for preemption! [14:59:37] (refining started again) [15:15:29] Analytics-Cluster, Analytics-Kanban: Add 'version' field to refined webrequest table in Hive - https://phabricator.wikimedia.org/T90725#1069884 (ggellerman) Open>Resolved a:JAllemandou [15:18:02] Analytics-Cluster, Analytics-Kanban: Add jar versions as parameters in oozie jobs - https://phabricator.wikimedia.org/T90736#1069899 (JAllemandou) a:JAllemandou [15:18:45] Analytics-Cluster, Analytics-Kanban: Add jar versions as parameters in oozie jobs - https://phabricator.wikimedia.org/T90736#1066190 (JAllemandou) p:Triage>Normal [15:21:00] qchris: would you mind looking this over? [15:21:02] you can say no :) [15:21:02] https://gerrit.wikimedia.org/r/#/c/193109/ [15:21:07] * qchris looks [15:24:15] ottomata: I have to admit that I do not fully understand what the parameters mean. Your comments make sense though. [15:24:41] Since we announced that the cluster has issues ... should we just try those settings in production? [15:28:15] yes i think so [15:28:18] can't hurt :p [15:28:34] so I 90% grasp the params [15:28:43] preemption will be the most useful one here [15:28:54] yup. [15:28:57] there is a min share that app gets from a queue, and a fair share. [15:29:06] this says that, if an appdoesn't get its min share after 60 seones [15:29:17] other containers (not jobs) from other queues will be killed [15:29:27] then, after 10 minutes, if a job doesn't get its fair share [15:29:30] it will start killing more containers [15:29:42] Yup. Those look good. [15:29:44] this one [15:29:44] yarn.scheduler.fair.locality.threshold.node [15:29:49] But I do not fully grasp [15:29:51] yarn.scheduler.fair.locality.threshold.node [15:29:53] right :-) [15:30:01] makes yarn delay a bit when first scheduling jobs [15:30:10] yarn wants to schedule jobs with data locality [15:30:21] so, in a cluster with nothing running [15:30:29] it will schedule jobs to run where the data that job needs is [15:30:33] but, if a cluster is busy [15:30:41] the node where the data is might be full [15:31:00] so, it will just schedule the job on whichever node is available [15:31:12] which means that the job will have to get the data over the network [15:31:28] Sure. But 1/3 ... seems big. We have jobs with 1500 mappers. [15:31:42] so, it only means [15:31:55] So wouldn't that mean that 500 maps would have to get scheduled locally before the job would start? [15:32:10] that it will wait until it has had opportunities to schedule the job on 1/3 of the cluster before it will schedule something non local [15:32:12] no [15:32:27] it just tells it to not immediately schedule a non local job [15:32:39] it says: wait a bit, maybe a local node will be available elsewhere [15:32:49] it waits until it has a chance to try 1/3 of the cluster [15:32:54] before it schedules non local [15:32:57] Curious to see that int production and make the cluster go zooooooooooooooooom :-) [15:33:07] i think anyway [15:33:09] see [15:33:10] Delay Scheduling [15:33:11] here: [15:33:13] https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html [15:33:25] Cool. Thanks for the link. [15:34:04] also, correct me if I misunderstood that [15:34:05] :) [15:48:06] ok, qchris, cluster looks pretty empty right now [15:48:09] one of ellery's jobs is running [15:48:16] it just started [15:48:19] i want to restart resource manager [15:48:48] gonna just do it. :) [15:48:50] wait [15:48:55] oh [15:48:58] hm [15:49:04] ohi think i am looking at the wrong queue [15:49:11] there are lots of hdfs jobs running too [15:49:29] ah yeah, ah, i didn't realize i could filter on queue in the scheduler ui [15:49:58] will just wait until this camus job is done, then restart, hopefully the ohter jobs will jstu go, if they don't, we can restart them in oozie [15:50:25] Ignore the hdfs jobs for now :-) [15:50:32] Those are mostly refining. [15:50:34] And some camus. [15:50:43] So restart the resource manager at will. [15:50:59] ottomata: ^ [15:55:07] this camus is almost done [15:55:15] also, um, analytlics1011 looks like i has problems? [15:55:17] looking into it: [15:55:24] http://localhost:8088/cluster/nodes [15:55:26] 9/12 local-dirs are bad: [15:55:39] I'd ignore the camus one. It'll just catch up when needed. [15:56:28] !log restarted resourcemanager on analytics1001 to load new fairscheduler settings [15:58:09] would be nice to have HA resourcemanager :/ [15:58:14] :-) [15:58:17] alljobs just got kablaamed. [15:58:23] Application with id 'application_1424120984454_20321' doesn't exist in RM. :( [15:58:32] I am about to restart the jobs. [15:58:36] cool [15:58:36] danke [16:09:04] Analytics-Tech-community-metrics, Phabricator, ECT-February-2015, ECT-March-2015: Metrics for Maniphest - https://phabricator.wikimedia.org/T28#1070074 (Aklapper) Though I played a bit more with Phab's SQL in the last days (phun phun phun!) I won't get into this task in Feb 2015 (nothing done on the... [16:35:47] Analytics-Cluster, Analytics-Kanban, Scrum-of-Scrums: Create Daily & Monthly pageview dump with country data - https://phabricator.wikimedia.org/T90759#1070191 (kevinator) p:Triage>Normal [16:37:09] Analytics-Cluster, Analytics-Kanban, Scrum-of-Scrums: Create Daily & Monthly pageview dump with country data - https://phabricator.wikimedia.org/T90759#1066753 (kevinator) [16:39:28] Analytics-Cluster, Analytics-Kanban: Refactor MobileApps uniques HQL to use external table to format data. - https://phabricator.wikimedia.org/T90730#1070204 (kevinator) p:Triage>Normal [16:39:53] btw, I just changed the table: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_visit_solution#Deliverable [16:40:01] to provide a visual understanding of what we're saying below [16:40:37] gaahhh [16:40:40] analytics1011 [16:40:44] why you unhealthy? [16:40:50] is it because your disks are at 91% utilization? [16:44:53] Analytics-Cluster, Analytics-Kanban: Refactor MobileApps uniques HQL to use external table to format data [5 pts] - https://phabricator.wikimedia.org/T90730#1070237 (kevinator) [16:46:16] qchris, joal, fyi, I just started a limited balancer job [16:46:39] analytics1011 has a lot of data on it, and i think this is why it isn't showing up for yarn usage [16:46:39] What does that do? [16:46:53] mhmm. ok. [16:46:54] i picked 3 other nodes to consider [16:47:06] Analytics-Cluster, Analytics-Kanban: Refactor MobileApps uniques HQL to use external table to format data [5 pts] - https://phabricator.wikimedia.org/T90730#1070250 (Nuria) - refactor mobile app uniques daily job to take advantage of kind of a "temporary" table (real external table and drop it) on hiv... [16:47:08] it will rebalacne blocks between analytics1011 + the other 3 nodes [16:47:16] until each are about withing 10% utilization of each other [16:47:21] cool. [16:47:28] i picked 3 nodes that had about 40% util [16:47:32] i could do the whole cluster [16:47:35] which might not be a bad idea [16:47:45] Btw. After the resource manager restart, the cluster is pretty snappy. [16:47:49] but, i'll let this go first, just to clear up analytics1011 [16:47:53] ha, cool [16:47:57] It can take more load than yesterday. [16:48:06] hm [16:48:07] Not sure yet whether or not it can take the full load again. [16:48:12] not sure why that would be though [16:48:23] i mean, it should prefer the essential jobs now much more [16:48:25] that's all I changed [16:48:37] maybe it just has less running cause we killed stuff [16:48:55] Before the restart, I could not refine all five webrequest_sources at once. [16:48:58] Now I can. [16:49:19] i did not explicitly kill stuff and and resources look comparable. [16:50:00] hm [16:50:01] weird [16:50:05] yup. [16:50:07] well, balacner could slow it down a bit :/ [16:50:08] Analytics-Cluster, Analytics-Kanban: Update documentation page for the refined webrequest table in hive - https://phabricator.wikimedia.org/T90726#1070263 (kevinator) p:Triage>Normal a:JAllemandou [16:50:12] but tha's why i limited it to a few nodes [16:50:14] rather than all of them [16:50:15] oh! [16:50:20] qchris, i was not aware of the data category [16:50:27] i have a bunch of emails yet to respond to today... [16:50:32] :-) [16:50:39] joal: we should just use this [16:50:40] https://wikitech.wikimedia.org/wiki/Category:Data_stream [16:50:44] rather than our Data hierarchy [16:51:23] Soujnds god to me [16:51:24] and/or we could move these pages into Data/ [16:51:31] As you prefer [16:51:34] but adding the category sounds like a better way to orgainize... [16:51:40] Analytics-Cluster, Analytics-Kanban: Update documentation page for the refined webrequest table in hive - https://phabricator.wikimedia.org/T90726#1065940 (kevinator) [16:51:41] i am a noob wiki user [16:51:41] Analytics-Cluster, Analytics-Kanban: Add 'version' field to refined webrequest table in Hive - https://phabricator.wikimedia.org/T90725#1070268 (kevinator) [16:51:45] even though i have worked here for 3 years [16:52:13] when's your anniversary?! [16:52:23] (PS1) Jsahleen: Add min and uz to for beta features dashboard. [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/193128 [16:53:41] (CR) Jsahleen: [C: 2] "Simple change." [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/193128 (owner: Jsahleen) [16:54:47] (CR) Ottomata: "Overall, I like." [analytics/refinery] - https://gerrit.wikimedia.org/r/192891 (owner: Joal) [16:56:06] milimetric: We added some new languages for our beta enablements dashboard. Changes have been merged. Can you push to the server? [16:56:22] sure jsahleen, one sec [16:57:23] jsahleen: done, but any data that needs to recompute won't be there for a bit [16:57:28] wow, qchris, a single mediacount day is huge, eh? [16:57:35] (as it waits for crons / rsyncs / etc.) [16:57:39] milimetric: uncertain [16:57:40] Just 300MB (compressed) [16:57:42] depends on when you start counting [16:57:52] i was 50% for a few months when i started [16:57:52] milimetric: Thanks! [16:57:59] i think i count from when I was 100% [16:58:02] which is probably like april something [16:58:12] ooh! coming up. You should write an ori - style email [16:58:17] ha [16:58:28] qchris: its huge! one day of aggregates is 300M compressed! [16:58:40] 300MB is huge? [16:58:45] for an aggregate, no? [16:58:48] pagecounts is 100MB/hour. [16:58:53] haha, is it really? [16:58:53] haha [16:59:00] :-) [16:59:05] its fine for sure, i'm just surprised [16:59:10] i guess there are a lot of images! [16:59:33] Analytics-Cluster, Analytics-Kanban, Easy: Mobile Apps PM has monthly report from oozie about apps uniques [8 pts] - https://phabricator.wikimedia.org/T88308#1070303 (kevinator) [16:59:35] There are :-D And videos, and audio files ... and images of math formulae :-D [16:59:41] ori-style email? [16:59:44] * ori blinks [17:00:21] ori: your awesome "I've been here for 2 years" one [17:00:26] Analytics-Cluster, Analytics-Kanban: Refactor MobileApps uniques HQL to use external table to format data [8 pts] - https://phabricator.wikimedia.org/T90730#1070307 (kevinator) [17:00:50] that was a great email, I starred it [17:01:18] Analytics-EventLogging, Analytics-Kanban: Investigate EventLogging Monitoring with Ops DBA - https://phabricator.wikimedia.org/T86200#1070311 (kevinator) This is blocked until we change the DB. [17:01:31] heh [17:05:59] Analytics: Upgrade daily/monthly aggregations of pageview dumps to new data files - https://phabricator.wikimedia.org/T90203#1070325 (Ottomata) Hm, unless it is very easy, I think you should hold off from making this change. pagecounts-all-sites uses the same pageview definitino as pagecounts-raw, except th... [17:10:17] qchris: holaaaaa, can i ask you a question? [17:10:28] !ask | nuria [17:10:28] nuria: Please feel free to ask your question: if anybody who knows the answer is around, they will surely reply. Don't ask for help or for attention before actually asking your question, that's just a waste of time---both yours and everybody else's. :) [17:10:34] qchris: jaja [17:10:41] ;-) [17:11:00] so what's your question? [17:11:01] qchris: rememeber the agreggator github depot that holds pageviews that are later shown in dashiki? [17:11:09] Analytics-Kanban: Analyze device class(mobile/desktop) and how it influences Edit Schema events {lion} - https://phabricator.wikimedia.org/T89728#1070344 (kevinator) a:Jdforrester-WMF [17:11:10] yup. [17:11:26] I guess it stalled for the last two days or so. [17:11:39] (Due to the cluster not producing all needed data) [17:11:40] qchriS; the permits on the files are such that apache now cannnot read them [17:12:11] https://www.irccloud.com/pastebin/Hy8APUqI [17:12:15] Analytics-Cluster, Analytics-Kanban: Estimate roughly of how many users might not have javascript capable/enable browsers, use CSS to crosscheck. - https://phabricator.wikimedia.org/T89847#1070351 (kevinator) p:Normal>Low [17:12:24] * qchris looks [17:12:56] No "x" for the directory. [17:13:17] but wait ... daily_temp ... [17:13:33] That does not look like a clone of the repo. [17:13:59] qchris: ah sorry, i think i pasted the wrong one, let me ssh [17:15:47] qchris: actually that is what i see on the machine now. I can reclone but for my life [17:16:08] i could not find the cron jobs that update this [17:16:25] Let me double-check on git.wikimedia.org [17:16:33] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: Reliable scheduler computes Visual Editor metrics [21 pts] {lion} - https://phabricator.wikimedia.org/T89251#1070361 (kevinator) [17:16:34] Analytics-EventLogging, Analytics-Kanban: Reliable scheduler collects Visual Editor deployments {lion} - https://phabricator.wikimedia.org/T89253#1070360 (kevinator) [17:17:05] https://git.wikimedia.org/tree/analytics%2Faggregator%2Fdata.git/3edec13e98dbbd14852aca7a5e0a0b5215af9ec1 [17:17:16] nuria: ^ is what the clone should look like [17:18:21] qchris: ok, will change, see permits on daily dir: drwxr-xr-x [17:18:44] If you've got that puppetized, you'd need to update that in puppet [17:18:45] qchris: i think we need the last x to be 'rx' [17:19:12] (the git::clone should provide an owner IIRC) [17:19:35] Yes, 'r-x' is the thing you want. [17:20:14] qchris: ok, so i did not find it on puppet but it is because it might not be there ... ahahajam [17:20:23] k :-D [17:20:36] Then chmod is your friend. [17:20:58] qchris: I swear i tried chmod with root and I could not change it ... [17:21:11] qchris: and then i was like .... what????? [17:21:28] Mhmm. Root should be able to do that. [17:21:36] What machine is that on? [17:21:47] qchris: wikimetrics1 [17:23:33] I tried [17:23:34] chmod o+r daily [17:23:41] as root in /srv/aggregator-data/projectcounts [17:23:42] and it worked? [17:23:46] and it worked. yes. [17:23:48] Weird. [17:23:55] man , ok, last nite retardation [17:24:25] But someone screwed up that repo. [17:24:36] It has untracked files. [17:24:38] Analytics-Cluster, Analytics-Kanban: Epic: qchris transition - https://phabricator.wikimedia.org/T86135#1070373 (kevinator) [17:24:42] ok, will fix, [17:24:57] my last question is where is the cron that updates It? [17:25:02] k. If you run into troubles with that, let me know. [17:25:08] No clue about the cron. [17:25:11] Let me look. [17:25:40] Not sure ... is there a cron that updates it? [17:25:56] seems to be by looking at git log [17:26:09] https://www.irccloud.com/pastebin/DkiUgkuO [17:26:17] k. [17:26:27] Are you running puppet on that host? [17:26:44] (Because puppet would update it IIRC) [17:27:11] Analytics-EventLogging, Analytics-Kanban: Reliable scheduler collects Visual Editor deployments [8 pts] {lion} - https://phabricator.wikimedia.org/T89253#1070388 (kevinator) [17:27:26] Yup. Seems like puppet is running. [17:27:45] Logs show things like: [17:27:48] Feb 26 07:59:07 wikimetrics1 puppet-agent[6804]: (/Stage[main]/Role::Wikimetrics/Git::Clone[aggregator_data]/Exec[git_pull_aggregator_data]/returns) executed successfully [17:28:02] So puppet is doing the repo updating for you. [17:28:04] nuria: ^ [17:28:32] qchris: Ok, will triple check today , fix repo and change permits, thank you. [17:28:38] Analytics-EventLogging, Analytics-Kanban: Reliable scheduler collects Visual Editor deployments [8 pts] {lion} - https://phabricator.wikimedia.org/T89253#1070391 (Milimetric) We can be relatively confident that the current git command gets a good approximation of "deployment". It will be a few days behin... [17:29:22] Relevant file in the puppet repo is manifests/role/wikimetrics.pp . [17:29:38] There is an "ensure => latest" that does the trick. [17:29:54] line 316. [17:29:56] yw. [17:31:12] Analytics-Engineering: udp2log: Announce new stream so people can compare streams - https://phabricator.wikimedia.org/T86205#1070397 (kevinator) @ottomata is working on turning off udp2log [17:35:13] Analytics-Cluster, Analytics-Kanban: Epic: qchris transition - https://phabricator.wikimedia.org/T86135#1070410 (kevinator) [17:36:19] Analytics, Analytics-Kanban: udp2log: Announce new stream so people can compare streams - https://phabricator.wikimedia.org/T86205#1070412 (kevinator) [17:38:20] qchris_away: http://dumps.wikimedia.org/other/mediacounts/daily/2015/ :) [17:38:44] ottomata: YOU THA BEST! \o/ [17:38:56] naw, YOU DA BEST [17:39:02] Now I only need to write some docs for mediacounts. [17:39:47] As reward ... I'll grab some food. (Cluster still catching up. But now catching up blazingly fast.) [17:39:51] :) [17:39:56] weird! [17:39:57] but good! [17:40:01] Hey ottomata, about jar verion [17:40:19] hmm, qchris_away [17:40:25] joal, ja? [17:40:34] so, i'm looking at one of the raw hours [17:40:42] 2015-02-25T21:00 [17:40:45] in sequence stats [17:41:15] Na, nothing ... sorry to bother [17:41:17] looks like a lot of missing data [17:41:27] ~50% for some hosts [17:41:42] i wonder if we might want to go back and reset camus and reload from kafka... [17:41:47] joal, naw, man, what's up? [17:44:33] Re-read your comment, found out that we were going for the same name :) [17:44:41] review comming [17:45:01] (PS2) Joal: Add refinery_jar_version as a parameter in oozie bundle properties. [analytics/refinery] - https://gerrit.wikimedia.org/r/192891 [17:45:05] :) [17:45:43] joal, looks good, i think you missed the change in bundle.properties! :p[ [17:45:57] NOWAYYYYYY ! [17:46:10] Sorry :) [17:46:19] np [17:46:45] (PS3) Joal: Add refinery_jar_version as a parameter in oozie bundle properties. [analytics/refinery] - https://gerrit.wikimedia.org/r/192891 [17:47:20] (CR) Ottomata: [C: 2 V: 2] Add refinery_jar_version as a parameter in oozie bundle properties. [analytics/refinery] - https://gerrit.wikimedia.org/r/192891 (owner: Joal) [17:47:54] About docs for the webrequest, I go and add the page to the Data Stream category, and move it back to Analytics roots ? [17:48:01] Ironholds: Woah. There's a user with ~ 10k VisualEditor edits on one wiki? Wow. [17:48:05] naw, i think we can keep Data/ [17:48:14] i already added to the category [17:48:17] so, you can just edit it [17:48:21] i moved the other pages into Data/ too [17:48:29] Riiiight [17:48:40] I need to F5 with you ;) [17:51:57] Anyone know what the max string length for a JSON value in EventLogging data is? Looks like some events are being discarded because the text I added is too long. [17:53:58] bearND: there's no hard limit, but certain browsers and proxy servers impose a limit on URL length [17:54:33] we append a ';' to the URL in EventLogging's JavaScript code as a way of detecting whether the URL was truncated [17:55:07] (since any organic semicolons in the data are encoded as "%3B") [17:55:23] bearND: what's the schema name? I can grep the look [17:55:25] *grep the log [17:56:45] (PS2) Joal: Add client ip and geocoded data to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/192363 [17:59:44] (CR) Ottomata: [C: 2 V: 2] Add client ip and geocoded data to refined webrequest table. [analytics/refinery] - https://gerrit.wikimedia.org/r/192363 (owner: Joal) [18:00:32] Thx for the quick review ottomata [18:00:44] Analytics-Kanban, MediaWiki-General-or-Unknown, JavaScript: mw.user.generateRandomSessionId not so random - https://phabricator.wikimedia.org/T78449#1070504 (kaldari) Has anyone upstreamed this bug to Apple? [18:00:45] When do you want us to go through the deploy process ? [18:01:45] Thanks ori. Schema name is MobileWikiAppShareAFact. [18:04:20] joal ahhhhhHHHH [18:04:37] um, after plan east coast hackathon meeting today? [18:04:39] ottomata: 5 mins [18:04:47] if that is too late for you, we can do tomorrow [18:05:36] ori: This is still in testing and not released behyond our alpha bits and dev machines. The action sharetap as an associated text field, which has the text the user has highlighted that she wants to share. We've noticed that if the user has highlighted a lot of text the event doesn't make it all the way through the system. [18:06:14] ori: On the client side I can restrict the length to some number. I'm just trying to find a reasonable max length. [18:08:22] thing is, tomorrow I'll be on the road, therefore not online [18:09:49] Let's do that after the meeting [18:09:54] I'll pack now then :) [18:10:05] Back in an hour or so [18:10:31] ok, monday is fine too joal [18:10:38] As you prefer [18:10:39] might not be bad, seeing as we are already having some job lissues [18:10:45] oh but hackathon [18:10:47] meh, sure! [18:10:49] ok, let's wait :) [18:10:50] heh [18:10:51] k [18:10:55] mouarf ... [18:11:00] As you wish [18:11:07] We can go for it this evening [18:11:15] ok, if are both online and want to lets [18:11:22] but, i'm a little worried about some of the eixsting data... [18:11:27] i migth want to try to reload some stuff from kafka [18:11:41] Yo, read your message for qchris [18:12:06] It"ll be great to hang around sometime next week for you to explain me the various jobs etc [18:12:56] eh? [18:12:58] yes [18:14:54] bearND: Doesn't sound like the maxlength has to be the maximun EL can stand right? (the smaller payload you sent the better) [18:15:35] bearND: How do you plan to use that text once you have it stored? (free flow text is very hard to analyze) [18:16:22] nuria: Deskana is planning to analyze it. So he would be better suited to answer this question. [18:16:58] bearND: k, [18:19:44] Analytics, Mobile-Apps, Scrum-of-Scrums, Wikipedia-App-Android-App, and 4 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1063059 (KLans_WMF) [18:20:02] nuria: Deskana: I'll use an arbitrary length of 99 characters for the text until we come up with a better value. [18:20:45] bearND: k [18:21:07] ori: Wow. Thanks for your comments. I had not realized that the EL data is all sent as a query parameter. [18:34:28] Analytics-Kanban, MediaWiki-General-or-Unknown, JavaScript: mw.user.generateRandomSessionId not so random - https://phabricator.wikimedia.org/T78449#1070667 (Nuria) I believe this is a known issue and not per se a bug, Math.random is not supposed to be cryptographically strong, one of many reports on t... [18:49:18] ottomata: The sequence stats need recomputation for some hours yesterday and today. [18:49:29] ok [18:49:32] I tried to first get all the data onto hdfs. [18:49:35] was goign to try to relaunch those too [18:49:35] ok cool [18:49:37] :) [18:49:41] Then do the refining. [18:50:05] Once those processes caught up, I'll do the sequence stat recomputation. [18:50:22] But if you look at the size of the directories, you see that there is no missing data. [18:50:36] ok good, was hoping that was the case [18:50:55] And the timestamp reveal when the sequence stat computation got kicked off and when the last log line was written. [18:51:07] Note that refining only happened today european afternoon. [18:51:11] (on purpose) [18:51:28] So only the hours got refined that were fully in hdfs. [18:52:20] aye [19:03:17] ok, [19:03:26] so did you see Eloquence's email? [19:03:29] Nope. [19:03:32] Is it on a list or direct? [19:03:36] direct [19:03:42] Okay, let me read it. [19:03:47] called "Guided Tour activations" [19:04:10] he's basically asking if we have data about the guided tour "funnel" [19:06:17] Read it, checking now [19:07:41] milimetric, Ready to talk VE data? [19:07:54] halfak: yes, batcave [19:08:13] milimetric, yeah, so we have data for all of it. [19:08:45] superm401: awesome, thank you very much, I've gotta talk to Aaron now but if you could reply with any specifics, that'd be great [19:08:48] then I can get crunching [19:08:50] milimetric, sure. [19:11:28] (PS7) Mforns: [WIP] [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/192319 (https://phabricator.wikimedia.org/T89251) [19:11:35] (CR) jenkins-bot: [V: -1] [WIP] [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/192319 (https://phabricator.wikimedia.org/T89251) (owner: Mforns) [19:22:13] milimetric, sent you all the nitty-gritty, let me know if there are any questions. [19:26:02] Analytics-Kanban, Analytics-Wikimetrics: Story: AnalyticsEng has editor_day table in labsdb - https://phabricator.wikimedia.org/T71145#1070950 (kevinator) p:High>Low [19:31:22] kevinator: meeting? [19:31:41] sure [19:31:57] (the planning hackathon one) [19:31:57] on my way\ [19:56:37] so joal, you want to do deploy ment now? [19:56:39] or monday? [19:56:44] As you prefer :) [19:57:06] We can go for it now, depends for how long there is [19:58:12] it shoudlnt' take long [19:58:18] Let's go for it [19:58:20] you don't h'ave deployment rights, but i can show you that part [19:58:21] K! [19:58:23] batcave! [20:05:16] Analytics-Tech-community-metrics, ECT-February-2015: Remove the filter for key Wikimedia software projects in korma.wmflabs.org - https://phabricator.wikimedia.org/T86154#1071127 (Acs) Open>Resolved Filters removed. You can check the new repositories list for git and gerrit in: http://korma.wmflabs... [20:05:33] ottomata: Hola. Yt? [20:16:09] (PS2) Milimetric: Analyze edit success rate by user type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/192944 (https://phabricator.wikimedia.org/T89729) [20:20:37] nuria / mforns / joal: much better I think: https://edit-analysis.wmflabs.org/adhoc.html#ve-success-rate-by-user-type.tsv [20:20:40] kevinator: ^ [20:20:59] that's the actual rate by user type, normalized by the number of sessions for each user [20:21:04] *for each user type [20:21:22] milimetric, awesome! [20:21:34] now it makes complete sense! [20:21:35] I'll add you to the code review, and given your blessing I'll present the data as I think it's actionable [20:21:42] or at least gives an idea of where to look next [20:22:29] milimetric: nice, did you talked with aaron about the normalization? [20:22:40] this chart is awesome :] [20:22:55] nuria: yes, of course, I'm useless by myself :) [20:22:57] milimetric, ok [20:23:14] also - he thought it's not worthwhile right now to dive deeper into more sophisticated classes of users [20:23:19] but he had an Awesome tip: [20:23:50] milimetric: excellent, i think we should round to 2 digits to have less visual clutter like 0.xx [20:23:53] when discussing these results, look out for what conclusions the audience (james, etc.) seem to disagree or have mixed opinions about. And dive deeper into those disagreements or confusions [20:24:08] nice, he's a pro [20:24:24] ah, yes, good point about the rounding [20:24:40] i know, I love going to school while I work [20:25:03] aska aaron if their std is 0.655 goes to 0.66 or 0.65 [20:25:52] milimetric: they probably have a convention, i have used 0.65 in the past [20:28:43] (PS3) Milimetric: Analyze edit success rate by user type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/192944 (https://phabricator.wikimedia.org/T89729) [20:28:44] milimetric: then the normalization so i understand it , did you divided / [20:29:07] nuria: yes [20:29:26] before it was / [20:29:30] which - silly :) [20:29:50] milimetric: actually this 2-tier review system has worked real well [20:31:04] milimetric: the data makes sense intuitively now [20:31:17] yeah [20:32:09] agreed, thanks very much to both of the tiers :) [20:32:18] ottomata: hola? [20:32:30] also - i rounded to 2 nuria, it's up if you refresh [20:34:11] milimetric: I think that is nicer for a quick grasp (check out dates they are coming like NaN now) [20:34:47] i just saw that, fixed [20:34:52] thank you milimetric, I find this chart really insightful, even if it confirms the intuitive [20:35:05] rounding dates to 2 decimal points apparently doesn't work :) [20:38:00] (PS4) Milimetric: Analyze edit success rate by user type [analytics/limn-edit-data] - https://gerrit.wikimedia.org/r/192944 (https://phabricator.wikimedia.org/T89729) [20:39:28] ottomata: I guess killing of the refinery job was you (on purpose during the deploy)? [20:39:48] yes [20:39:54] joal and I are deploying the new one [20:39:54] cool. thanks. [20:40:15] something is weird, but ja [20:50:29] ottomata: sorry I missed that last line ... something is weird? What is weird [20:50:31] ? [20:55:00] ottomata, joal: Did you test the new schema on old data? [20:55:04] It breaks for me: [20:55:11] hive -e "select * from wmf.webrequest where webrequest_source='bits' and year=2015 and month=2 and day=26 and hour=17 limit 3;" [20:55:30] ^ on stat1002 gives "java.io.IOException:java.lang.IllegalStateException: Column record_version at index 20 does not exist in [...]" [20:55:41] On new data (hour=18) it works. [20:56:33] joal, ottomata ^ [20:56:56] yes [20:57:01] you cna't do select * anymore [20:57:06] What? [20:57:14] but we have other problems, new fields are being inserted with NULL [20:57:16] tryign to figure out why [20:57:23] we think because teh create table used the old deprecated format, so hive is funky [20:57:29] testing and trying things now [20:57:30] :-((( [20:57:33] qchris, its ok though. [20:57:38] you just can't select * on old data [20:57:50] you have to specify fields [20:57:58] the schema has the new fields in it [20:58:05] but the old parquet files do not have that in their schema [20:58:49] the table should still work at the moment, jsut no selcct * on old data [20:58:51] I think that's pretty bad. But don't worry about that right now. You said you focus on other issues right now. Sorry. Did not want to disturb you. [20:59:40] haha, its bette rthan having to recompute all the data! [20:59:40] :) [20:59:45] who needs select * anyway :p [21:00:17] :-D [21:21:16] Analytics-Kanban: Adhoc Analysis: Guided Tour activations - https://phabricator.wikimedia.org/T90942#1071435 (Milimetric) NEW a:Milimetric [21:41:14] milimetric: just saw the new graph. wow, it’s amazing how much easier it is to grasp. [21:41:36] apologies again for the initial confusion [21:47:33] hey milimetric [21:47:42] New graph is really cool ! [21:48:20] :) [21:48:47] the more experimented, the less errors [21:48:49] :) [22:02:44] ottomata: all the needed jobs on the cluster caught up. [22:02:51] wp-zero jobs are enabled again. [22:03:02] ew-ulcyn's backfilling finished. [22:03:15] Is it ok to announce the the cluster is usable again? [22:03:25] awesooome [22:03:28] (or are there still issues with the schema change on webrequest table?) [22:03:29] we are cehcking on the refine jobs [22:03:33] i think we shoudl be good [22:03:39] we are waiting for one to finish to double check [22:03:50] two already finished. [22:04:01] ? [22:04:08] OH [22:04:10] just finished [22:04:16] 0009812-150220163729023-oozie-oozi-C@1 [22:04:19] 0009812-150220163729023-oozie-oozi-C@2 [22:04:25] yup. [22:04:36] WOO [22:04:37] it works [22:04:43] yay! [22:04:45] Awesome! [22:05:17] So can I leave the "cluster is ok again to use" email to you, so you can also announce the new columns? [22:05:27] (And explain the 'select *' thing?) [22:05:40] yes! [22:05:44] You rock! [22:05:48] Thanks. [22:05:53] but I HAVE TO RUN NOWWWW i am always late for stuffffffff i have to get to this back before it closes. [22:05:57] AHH [22:05:59] i can do it real fast. [22:08:45] email sent. [22:08:56] Aaaaaaaaaaawesome! [22:09:11] Thanks. [22:34:41] Analytics-Dashiki, Analytics-Kanban, Patch-For-Review: Pageviews not loading in Vital Signs - https://phabricator.wikimedia.org/T90742#1071869 (ggellerman) a:Nuria [22:40:51] Analytics-Cluster, Analytics-Kanban: WMF has technical documentation on UC by last visited date [5 pts] {bear} - https://phabricator.wikimedia.org/T88812#1071894 (kevinator)