[00:00:23] qchris, there you go, merged one of your patches! [00:00:36] ? [00:00:48] oh ua-parser? [00:01:00] mhmm ... thanks. [01:02:23] (CR) Nuria: [C: 2] Show namespace field description next to the field and remove it from placeholder [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/168202 (https://bugzilla.wikimedia.org/71582) (owner: Bmansurov) [01:02:30] (Merged) jenkins-bot: Show namespace field description next to the field and remove it from placeholder [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/168202 (https://bugzilla.wikimedia.org/71582) (owner: Bmansurov) [01:05:53] Analytics / Wikimetrics: Story: WikimetricsUser has better explanation counting on all namespaces - https://bugzilla.wikimedia.org/71582 (nuria) PATC>RESO/FIX [04:59:53] (CR) Springle: [WIP] Add schema for edit fact table (2 comments) [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/167839 (owner: QChris) [05:03:37] (CR) Springle: [WIP] Add schema for edit fact table (1 comment) [analytics/data-warehouse] - https://gerrit.wikimedia.org/r/167839 (owner: QChris) [12:25:54] (PS2) QChris: Bump version number after release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/168779 [12:25:55] (PS1) QChris: [WIP] Add Hive UDF for media url identification [analytics/refinery/source] - https://gerrit.wikimedia.org/r/169346 [12:29:28] (CR) QChris: [C: -2] "Just drafting" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/169346 (owner: QChris) [13:30:40] Analytics / Wikimetrics: User validation throws exception when accessing an unknown project - https://bugzilla.wikimedia.org/72582 (Marcel Ruiz Forns) NEW>ASSI a:Marcel Ruiz Forns [13:43:05] (CR) Ottomata: [C: 2 V: 2] Bump version number after release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/168779 (owner: QChris) [13:43:11] \o/ [13:44:55] refinery trainer?!!?! :O [13:44:56] :) [13:45:05] We need to detect if things change. [13:45:14] So we can cron that trainer. [13:46:03] And ... the approach is reusable for other UDFs too ;-) [13:49:40] you will have to train me about training, but cool! [13:52:38] I think the task for the trainer should play eye of the tiger as it goes. Add that to the requirements. [13:52:51] oh, also! Weightlifting montage. You can't have Eye of the Tiger and no montage, that's just silly. [13:53:27] :-P [13:53:41] ottomata: Run [13:53:42] for I in $(seq 20141001 20141028) ; do echo $(date) $I ; DATE=$I /home/qchris/refinery-source/trainer/IdentifyMediaFileUrl/train.sh --skip-rebuild ; done [13:53:45] on stat1002. [13:54:04] That will check the sampled-1000 logs for october for URLs that the UDF cannot fully parse. [13:54:19] ottomata, <3 [13:54:26] :) [13:55:58] ottomata, how long until and I expect to be in the research group? [13:56:10] * halfak would like to update his .my.cnf ASAP [13:56:19] ? dont' you already have the password? [13:56:38] Oh. I do. I just want to make a simlink to the protected password file. [13:56:44] *symlink [13:57:23] ah, the pw file on stat1003 is not in my.cnf format...yet [13:57:25] i want to change that [13:57:32] Ahh! [13:57:36] Yes please :) [13:57:41] but how to properly change that is in discussion here [13:57:41] https://gerrit.wikimedia.org/r/#/c/168993/ [14:01:59] hm i'm having trouble joining the hangout, it doesn't show anyone else is there and is just "ringing". [14:02:20] https://plus.google.com/hangouts/_/calendar/d2lraW1lZGlhLm9yZ19jYjM3bXU0OGNuaHRkN2hybmE4czI3b25hb0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t.c6j7qidqs491nhi7ovk9pi4h14 [14:02:53] halfak: you are in that group on stat1003 now [14:03:06] Thanks ottomata. [14:03:40] * halfak imagines a cron job that would copy the password into a .my.cnf [14:04:22] hm thanks ottomata. now i'm joined but it seems like it's not sending video, hrm [14:33:24] technical difficulties [14:34:56] milimetric: nuria__ mforns qchris_meeting the researchers are in the hangout specified in the meeting [14:35:00] not the batcave [14:35:51] kevinator: I did't get the invitation to the meeting [14:36:00] I will forward [14:36:52] thanks@ [14:36:54] ! [14:39:23] milimetric, can we take 30m today to chat about the cubes and how to port info there? I have some datasets I'm planning to generate that'd be best punted there [14:41:39] Ironholds: of course [14:42:00] but Ironholds: the answer is: a new data warehouse that Sean and I are currently designing [14:42:19] cool! But in the meantime I have a daily-updated TSV that apps wants. Where do I put it? ;p [14:43:03] there's a cronjob running that copies things from a specific places in stat1002 to public places [14:43:06] so you can put it there... [14:43:13] totally! [14:43:14] similar to what exists in stat1003 [14:43:20] assuming this is public info [14:43:20] yes! I know! [14:43:30] and then I...ask apps product managers to make graphics with their minds? ;p [14:43:36] Limn [14:43:48] apps already uses limn [14:43:49] you rang? [14:43:52] oh :) [14:44:24] put out *SV, graphs in limnm [14:44:32] YuviPanda: Ironholds is talking about this thing we've been playing with: http://pentaho.wmflabs.org/ [14:44:39] oh [14:44:56] (click "Login as an Evaluator") [14:45:05] I was just responding to > cool! But in the meantime I have a daily-updated TSV that apps wants. Where do I put it? ;p [14:45:10] will do, milimetric! [14:45:18] and then Create New -> New Saiku Analysis [14:45:29] then select the only available cube and play around, it's kind of fun [14:45:47] the idea is to make a general warehouse that this kind of cube to be built on top of [14:45:54] and start exposing data, including hopefully event logging [14:46:16] YuviPanda, yes, but Limn is hinky, ugly, not being supported in the long-term and not suited for this sort of data. [14:47:25] Ironholds: I'm sure the mobile app team PMs are ok with waiting for a non hinky one :) [14:47:36] indeed! [14:47:44] which is why I'm asking about how I put data in a non-hinky one ;p [14:47:51] ah, then all cool [14:47:58] I should stop barging into conversations mid way [14:48:10] Ironholds: nothing's ready yet sadly, so Limn for now [14:48:23] milimetric, darnit. But cubes! :( [14:48:32] okay. In that case I'll puzzle out limn [14:48:38] (/public-datasets/, right?) [14:48:49] Ironholds: I can help with that (and yes, public-datasets [14:48:50] ) [14:48:55] yay! [15:01:11] ottomata, i have done some zero log normalization and placed them into /a/zerologs/w2h-cache on stat1002. What do you think would be the best way to import them - should i merge it all into one file, sort | uniq them, and upload them while also breaking into partaitions? [15:01:32] bwerrrrr [15:03:56] hm, yurikR, you could hdfs dfs -put them into partition directories in hdfs somehwere [15:04:00] and then map an external table on them [15:04:18] ottomata, but they are not cleanly broken along partition lines [15:04:19] or, you could hdfs -put them them all into a single temp directory somewhere in hdfs [15:04:42] and then make a temp external table that knows how to look at the fields and timestamps [15:04:56] and select ... from insert into a new table [15:05:03] with partition values set from by the record timestamp [15:05:31] could could do that in several queries, using where timestamp BETWEEN [15:05:38] or, i think you can teach have to do dynamic field based partition mapping [15:05:43] i have not tried to do this [15:05:46] but you should be able to [15:05:52] the files structure is already identical to webrequest, with proper partition fields [15:06:11] its just that in the same file you could have multiple partitions [15:06:19] due to logrotate? [15:06:20] and also same partition in multiple files [15:07:05] i was converting one->one file from original. Yes, rotation, and also we used to break them up by carrier [15:08:31] yurikR: https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert [15:08:44] so you think i should just dump all these files into the same hadoop db that has no partitioning? [15:09:55] what are you planning on doing with these? [15:10:02] are you going to update these with new data as it comes in? [15:10:45] i plan to merge this data with the realtime webrequest data and generate one historical view [15:10:52] ah right [15:11:11] without actually merging of course [15:11:45] yurikR: i would: [15:11:45] hdfs dfs -put this data into a single temp directory, [15:11:45] map a temp hive external table on top of it [15:11:45] do a select * insert into from that table into your main table, using dynamic parittion inserts to infer the proper partitions from each recrod [15:11:52] ". In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into." [15:12:32] then, once you have that done, you can delete the temp external table and the temp hdfs data dir [15:13:04] do i need to pre-sort each file? [15:14:03] naw, don't thikn so [15:14:18] i haven't done this before, though [15:14:21] but i think it shoudl work [15:14:37] ok, so i need to upload before mapping? [15:19:56] ? [15:20:03] oh, no doesn't matter [15:20:11] you can put hte data first, or create the external table first, doesn't matter [15:20:16] but, you'd need to hdfs put the data [15:20:21] before select insert into [15:20:40] oki, it just finished with the last file, uploading... [15:38:03] going to have lunch, I'll be back in 1 hour, bye [16:32:37] hey, do you guys know if search engine informational boxes result in a 1:1 counted pageview for api hits? also, any idea if the android and ios voice search providers make live hits against the api? or is it more likely that this stuff is cached at intermediaries? [16:38:21] milimetric, nuria__: hi! can you explain to me how sprint demo works? [16:39:53] mforns: sure [16:40:00] so kevin runs the meeting [16:40:04] ok [16:40:05] tells the summary of the sprint [16:40:12] and gets to the showcase [16:40:16] anything we have to show - we talk about [16:40:26] so you would share your screen and go over the same stuff you went over yesterday with me [16:40:41] ok [16:40:53] who's attending? [16:41:09] basically - upload a cohort, show that it's been expanded. You can run a report on the cohort to show the individual results so people can see the expansion [16:41:18] ok [16:41:36] everyone's invited, but attendees are usually grantmaking folks, sometimes members of other teams [16:41:47] whoever has half an hour to burn or some stake in what we're doing. [16:41:56] fine [16:42:13] one other question [16:42:36] I modified in staging the changes for the 2 bugs we encountered yesterday [16:42:59] the newline stuff and the unaccessible project database [16:43:48] because the newline bug is not merged yet, and for the other I still have to write the test [16:44:06] is it ok to leave them in staging? [16:44:10] yeah, that's fine [16:44:26] and the last thing [16:44:28] those thing can more than likely get merged by Thursday's deployment [16:44:49] when we deployed to stating, I remember we did a cherry-pick for puppet repo [16:44:54] but not for wikimetrics repo [16:44:54] yes [16:45:20] for puppet there are two repos - operations/puppet and operations/puppet/wikimetrics [16:45:26] I'm not sure, but didn't we checkout wikimetrics to another head? [16:45:38] the cherry pick is only necessary on operations/puppet because we have local commits there [16:45:44] ok [16:46:08] the large CSV file changeset is wikimetrics code? [16:46:41] it was a migration, right? [16:48:10] it has a migration yes, an index [16:49:08] nuria__: are you going to demo that? Like - here's a large CSV download, look how fast? [16:49:44] I’m watching… what do you want to demo :-) [16:50:33] :] [16:52:13] all the stories? [16:53:02] download larve CSV, new edits/pages metrocs, bots filtered out of metrics and finally centrau auth insertions [16:53:36] jgage: are you prepared to demo logstash too? [17:03:16] kevinator: last I heard jeff was not yet ready. No problem, we have a couple of things to show regardless [17:03:45] nuria can also talk about the difference in our metrics with the bot filter [17:03:50] she has some graphs [17:04:06] yeah… and I don’t think we ever talked about the new bookmarkable URL either... [17:07:46] OK, I completed the deck. [17:08:47] nuria__ mforns feel free to edit the deck. I put in pages for each of you on what I think you will showcase https://docs.google.com/presentation/d/1Phslf7NZvnaAThrt5B39U6q38mQ4lhWUuOnrcvD53AY/edit?usp=sharing [17:09:41] milimetric: can we mark this story as fixed? https://bugzilla.wikimedia.org/show_bug.cgi?id=66843 [17:14:34] kevinator: done [17:14:37] Analytics / Wikimetrics: Story: User creates cohort with CentralAuth insertions - https://bugzilla.wikimedia.org/66843 (Dan Andreescu) ASSI>RESO/FIX [17:14:47] Thanks! [17:16:06] kevinator: can you give me edit permits on the doc, please? [17:16:43] you have edit permission. Perhaps you’re viewing it as a different user? [17:18:01] Google show only one viewer on the doc right now and it’s “Anonymous Wolverine” [17:19:18] oh yeah, I'm on it with my other account, but it has nothing to do with Wolverine xDDDDDDDDDDDD [17:19:57] thanks [17:20:02] you’re welcome [17:20:58] Google seems to assign random animals to people who can view a public doc. [17:22:26] I’m gonna get a bagel before the next set of meetings. [17:22:28] BRB [17:39:26] kevinator: will add couple plots about bots [17:39:41] thanks [17:46:54] Analytics / General/Unknown: Make sure 2013 traffic logs are gone from /a/squids/archive on stat1002 - https://bugzilla.wikimedia.org/63543#c7 (christian) It seems RT 8760 (which is about auditing log retention on stat1002) might be relevant here. [18:01:24] FYI the room for the Showcase isn’t free yet… I’m waiting outside. [18:04:36] kevinator: thanks [18:07:09] Analytics / Wikimetrics: report table performance, cleanup, and number of items - https://bugzilla.wikimedia.org/72635 (Dan Andreescu) NEW p:Unprio s:normal a:None A Central Auth cohort creates many rows, because with the current implementation, one MetricReport node is made for each projec... [18:28:51] that was pretty cool. thanks mforns for completing centralauth stuff. hope my code wasn't hard to navigate. :) [18:29:44] terrrydactyl: thank YOU, the task was more than 50% done, so it was a lot more easy to catch up :] [18:30:53] heh, what are you going to be working on next? [18:33:23] in wikimetrics we'll continue with cohort tags [18:33:52] I mean, filtering by tag [18:36:33] like so other people can search for it or just within a user? [18:36:47] curious what analytics decided to do with tagging. [19:46:10] Analytics / EventLogging: Story: Identify and direct the purging of Event logging raw logs older than 90 days in stat1002 - https://bugzilla.wikimedia.org/72642 (nuria) NEW p:Unprio s:normal a:None Story: Identify and direct the purging of Event logging raw logs older than 90 days in stat... [19:46:52] Analytics / EventLogging: Story: Identify and direct the purging of Event logging raw logs older than 90 days in stat1002 - https://bugzilla.wikimedia.org/72642#c1 (nuria) This is tracket as part of RT ticket: #8760 [19:54:02] !log Ran kafka leader re-elections, since analytics1021 got kicked out of being partition leader (See {{bug|72550}}) [19:56:23] Analytics / Refinery: analytics1021 getting kiked out of kafka partition leader role on 2014-10-27 ~07:12 - https://bugzilla.wikimedia.org/72550#c1 (christian) I ran a leader re-election. Analytics1021 is leader for a few partitions again. (Still pending on check whether leader re-election caused loss... [20:10:25] Analytics / EventLogging: Add test flag to EventLogging - https://bugzilla.wikimedia.org/72365#c2 (christian) (In reply to nuria from comment #1) > This is due to the fact that EL does not have a server testing environment > in which you can try your data setup. Vagrant can run EventLogging. (At leas... [20:12:00] (CR) Nuria: "Fix looks fine although i could not properly tested as I have no windows vm or machine around." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/169223 (https://bugzilla.wikimedia.org/72581) (owner: Mforns) [20:12:43] (CR) Nuria: [C: 2] "Fix looks fine although I could not properly test it as I do not have a file created on a Windows machine." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/169223 (https://bugzilla.wikimedia.org/72581) (owner: Mforns) [20:12:51] (Merged) jenkins-bot: Normalize windows line endings in cohort csvs [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/169223 (https://bugzilla.wikimedia.org/72581) (owner: Mforns) [20:13:09] nuria: you can use the file that I'll pass to you now [20:16:24] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [20:16:25] Analytics / Refinery: Raw webrequest text partition for 2014-10-27T15/3H not marked successful - https://bugzilla.wikimedia.org/72644 (christian) NEW p:Unprio s:normal a:None The text webrequest partition [1] for 2014-10-27T15/3H has not been marked successful. What happened? [1] _______... [20:53:23] Analytics / EventLogging: Add test flag to EventLogging - https://bugzilla.wikimedia.org/72365#c3 (nuria) >Vagrant can run EventLogging. Can run the client code, yes. But it has not "storage" for events thus you do not known if they validate and whether they are going into the DB. And this, more often... [21:32:08] Analytics / Refinery: analytics1021 getting kicked out of kafka partition leader role on 2014-10-27 ~07:12 - https://bugzilla.wikimedia.org/72550#c2 (christian) This bug is still missing the numbers of lost messages when analytics1021 lost it's partition leader role. For the text cluster, it only affe... [21:34:03] !log Marked raw text webrequest partition for 2014-10-27T07/1H ok (See {{bug|72550}}) [21:34:09] !log Marked raw upload webrequest partition for 2014-10-27T07/1H ok (See {{bug|72550}}) [21:51:23] Analytics / Refinery: Raw webrequest text partition for 2014-10-27T15/3H not marked successful - https://bugzilla.wikimedia.org/72644#c1 (christian) NEW>RESO/FIX Only amssq42 was affected and showed resets in sequence numbers. Since amssq42 was depooled for trusty testing at that time, the resets... [21:52:50] !log Marked the 3 raw text webrequest partitions for 2014-10-27T15/3H ok (See {{bug|72644}}) [21:56:23] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to configuration updates - https://bugzilla.wikimedia.org/72300 (christian) [21:56:25] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to depooled servers interfering with monitoring - https://bugzilla.wikimedia.org/72649 (christian) NEW p:Unprio s:normal a:None In this bug, we track issues around raw webrequest partitions (not) being marke... [21:56:54] Analytics / Refinery: Raw webrequest text partition for 2014-10-27T15/3H not marked successful - https://bugzilla.wikimedia.org/72644 (christian) [21:57:08] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to depooled servers interfering with monitoring - https://bugzilla.wikimedia.org/72649 (christian) [21:57:09] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [21:57:24] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [21:57:24] Analytics / Refinery: Raw webrequest bits partition for 2014-10-22T23/1H not marked successful - https://bugzilla.wikimedia.org/72428 (christian) [21:57:25] Analytics / Refinery: Raw webrequest partitions that were not marked successful due to depooled servers interfering with monitoring - https://bugzilla.wikimedia.org/72649 (christian) [21:58:01] wikibugs ... why you so spammy? [21:58:37] :} [22:26:23] Analytics / Refinery: Raw webrequest bits partition for 2014-10-26T21/1H not marked successful - https://bugzilla.wikimedia.org/72548#c3 (christian) ganglia again shows data for esams caches, but the data between ~2014-10-24T12 and ~2014-10-27T16 is missing (which contains the minute where we had cp301... [22:39:05] qchris, i'm getting all sorts of errors while running a big query. How do i set up a proxy to access all hadoop servers? [22:39:23] i am able to see the http://localhost:8888/cluster/app/application_1409078537822_56807 [22:39:32] but not the logs, etc [22:40:29] Previously, I just set up the tunnels for the machines I was interested in, but since hdfs got mounted to plain fs, you can just read through the logs on stat1002. [22:40:38] Let me grab the path ... [22:41:24] yurikR: You should find all information you need underneath /mnt/hdfs/var/log/hadoop-yarn/apps/yurik [22:41:28] (on stat1002) [22:41:40] (in plain fs, so you can grep, find, ...) [22:43:52] qchris, true, but would be nice to use the web interface for stuff [22:43:57] If you really want the logs in a web interface ... kraken had a script to set up all the forwards one needs. it is at [22:43:59] https://git.wikimedia.org/raw/analytics%2Fkraken.git/658a43dd27595e5b6a5dffe14fb4e5c3720d9026/bin%2Fktunnel [22:43:59] is there a way to proxy it all? [22:44:22] Analytics / Wikimetrics: report table performance, cleanup, and number of items - https://bugzilla.wikimedia.org/72635 (Kevin Leduc) p:Unprio>Highes [22:44:33] I guess that script is still working. [22:44:56] But ottomata uses another tool. Some intrusive network thing ... let me try to find the name ... [22:45:03] i saw that script, but a) its a linux only, and b) it doesn't look like it can share all of hadoop servers. What I think i need is a virtual net [22:46:08] e.g. i ssh into a VPN and DNS resolves hosts via vpn first [22:46:33] but thx for the /mnt path, looking... [22:48:47] There it is ... ottomata uses sshuttle. [22:49:25] Analytics / Refinery: Spike: Assess feasibility and effort to add fields to webrequest logs - https://bugzilla.wikimedia.org/72651 (Kevin Leduc) NEW p:Unprio s:normal a:None Quote from Trello: "Aaron, Oliver and [Dario] sat down to think of a number of ways in which we could facilitate pars... [22:49:43] https://github.com/apenwarr/sshuttle [22:50:34] alias wmvpn='sshuttle --dns -vvr otto@bast1001.wikimedia.org 10.0.0.0/8' [22:50:34] Hi qchris, can you look at the bug above ^^ The researchers want to know from you and Andrew if this if possible [22:50:56] yurikR: ^ (the sshuttle lines above) [22:51:37] qchris, thx. Btw, the log seems to be semi-binary: less /mnt/hdfs/var/log/hadoop-yarn/apps/yurik/logs/application_1409078537822_56841/analytics1017.eqiad.wmnet_8041 [22:51:37] kevinator: Is it urgent? (just wanted to log out, before yur ikR grabbed me) [22:51:54] * yurikR is good at that [22:51:59] no, not urgent… you can look at it tomorrow [22:52:16] and thanks for the links! [22:52:18] yurikR: It's not all 7-bit ascii, but less, and grep works nicely. [22:52:25] kevinator: k [22:53:24] Oh. I forgot to paste, where ottomata mentioned the above commands. It's from: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140725.txt [22:58:13] thx! [22:58:20] yw. [22:58:22] Good night!