[00:03:00] (CR) Nuria: Bootstrapping from url. Keeping state. (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) (owner: Nuria) [00:13:05] (PS13) Nuria: Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [00:14:18] kevinator: i totally forgot about deployment today too [00:14:37] I pinged dan earlier today [00:14:39] he did it [00:14:51] I was about to write an email to wikimetrics about it. [00:30:13] man, hadoop is FAST [00:30:34] SELECT project, count(*), group by project over a month of mobile and desktop requests, started an hour ago. Quarter of the way through mapping. [00:30:47] * Ironholds shakes head in amazement. [00:37:02] Analytics / Wikimetrics: Story: WikimetricsUser has better explanation counting on all namespaces - https://bugzilla.wikimedia.org/71582 (Kevin Leduc) NEW p:Unprio s:minor a:None This applies to edits and pages created metrics. You need to delete the text in the "namespaces" textfield to se... [00:38:00] Analytics / Wikimetrics: Story: WikimetricsUser has better explanation counting on all namespaces - https://bugzilla.wikimedia.org/71582#c1 (Kevin Leduc) Created attachment 16654 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16654&action=edit pic of bug [01:44:01] (PS1) Yuvipanda: Update fab to send to new hosts [analytics/quarry/web] - https://gerrit.wikimedia.org/r/164501 [01:44:17] (CR) Yuvipanda: [C: 2] Update fab to send to new hosts [analytics/quarry/web] - https://gerrit.wikimedia.org/r/164501 (owner: Yuvipanda) [01:44:22] (Merged) jenkins-bot: Update fab to send to new hosts [analytics/quarry/web] - https://gerrit.wikimedia.org/r/164501 (owner: Yuvipanda) [09:12:23] (Abandoned) Milimetric: Improve survival metric query performance [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/121630 (https://bugzilla.wikimedia.org/63201) (owner: Milimetric) [09:13:46] (PS2) Milimetric: Cleaning up before doing anything in the build [analytics/dashiki] - https://gerrit.wikimedia.org/r/164254 (owner: Nuria) [09:14:00] (CR) Milimetric: [C: 2 V: 2] Cleaning up before doing anything in the build [analytics/dashiki] - https://gerrit.wikimedia.org/r/164254 (owner: Nuria) [09:15:47] (PS14) Milimetric: Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) (owner: Nuria) [09:15:56] (CR) Milimetric: [C: 2 V: 2] Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) (owner: Nuria) [09:16:05] (CR) Milimetric: Bootstrapping from url. Keeping state. (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) (owner: Nuria) [09:39:32] Analytics / Wikimetrics: slow report result and report list fetching - https://bugzilla.wikimedia.org/71603 (Dan Andreescu) NEW p:Unprio s:normal a:None It looks like we need a simple index on the report table, probably on the user_id column, to speed up two operations that have become slow... [11:30:02] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606 (christian) NEW p:Unprio s:normal a:None Dan reported through email that we're seeing [Sun Sep 28 14:24:23 2014] [error]... [11:47:45] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c1 (christian) The first occurrence I could find in the logs has a timestamp of [Sat Sep 20 14:37:58 2014] ... almost two weeks bac... [11:51:00] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c2 (christian) The webserver's config around api.wsgi is from 2014-01-20, so that should not have changed in the last two weeks. [12:00:55] mornin qchris! [12:01:02] Heya! [12:01:45] ok, so i think when i do this naming thing, I have an OO / REST background in mine [12:01:54] which is why I like noun/verb so much [12:02:38] well, maybe REST isn't really relevant here [12:02:45] but I llike organizing from largest to highest concept [12:02:52] and nouns seem to be a larger concept to me [12:03:02] Ah. naming :-) [12:03:03] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c3 (christian) D'oh! The IP is not the machine's IP, but that of dynamicproxy-gateway.eqiad.wmflabs. [12:03:09] hahaa [12:03:10] (PS1) Milimetric: Style layout to fill page [analytics/dashiki] - https://gerrit.wikimedia.org/r/164536 [12:03:12] yeah i just read your review [12:03:12] (PS1) Milimetric: Style color swatches square and borderless [analytics/dashiki] - https://gerrit.wikimedia.org/r/164537 [12:03:14] (PS1) Milimetric: Style metric selector button and popover [analytics/dashiki] - https://gerrit.wikimedia.org/r/164538 [12:03:16] (PS1) Milimetric: Style border radii and project list [analytics/dashiki] - https://gerrit.wikimedia.org/r/164539 [12:03:18] hm, but you want to have big_noun/verb_yeah_yeah [12:03:51] i want big_noun/smaller_noun/even_smaller_noun_if_needed/verb [12:03:56] or not! [12:04:21] why you gotta be hatin on adjectives yo [12:04:24] i think I don't want to make a blanket policy across everything. it will be tough to define something that fits every circumstance [12:04:29] sure, adjectives! [12:04:41] :-) [12:04:50] Ok. Let's focus on /oozie then. [12:04:56] hha [12:04:59] well, i mean even within /oozie [12:05:15] Ok. [12:05:31] I am all for structure. [12:05:44] But having two abstraction levels for only [12:05:59] two oozie jobs, just seem overstructured. [12:06:04] btw, i'm about to add a drop workflow [12:06:12] util/hive/partition/drop [12:06:18] Then s/two/three/ [12:06:28] util is separate for me. [12:06:31] that i want to use as a subworkflow for sqoop stuff [12:06:32] yes agree. [12:06:38] hive/partition/drop [12:06:51] webrequest/partition/add [12:07:10] hm, you say that you would like that to be check_integrity? [12:07:20] A bit. yes. [12:07:22] but, the reason that dir exists is to add a partitoin. [12:07:34] Originally, yes. [12:07:39] it just happens that we check and report on the partition's integrity too [12:07:43] But now the key part (at least from my point of view) [12:07:49] is that it checks integrity. [12:08:10] So ok. Partition adding is more important to you. [12:08:11] Mhmm. [12:08:22] Was only a side effect for me. [12:08:46] that is where more of the work is in that workflow, but that is almost more a monitoring thing [12:09:06] (PS2) Milimetric: Style border radii and project list [analytics/dashiki] - https://gerrit.wikimedia.org/r/164539 [12:09:19] like, i wouldn't name a the kafka puppet module after the monitoring class that it has in it [12:09:35] Unfair comparison :-D [12:09:46] But I agree. The oozie workflow does two things. [12:09:53] (agree unfair :p) [12:09:59] We have different views on which of the two is more important. [12:10:22] well, i guess, ok. I think OO is going out of style, but that is where I draw most of my influences :p [12:10:28] object.verb() [12:10:38] Ok. Let's ignore the names for a second. [12:10:47] Do you think we need that much layers of structure? [12:10:54] (Regardless of whether noun or verb) [12:11:24] I mean we totally could, but it's just that [12:11:35] ok, so, for example, you would prefer to have [12:11:39] it makes it hard for me to cd down the whole hierarchy when trying to find things. [12:11:43] hive/add_partition [12:11:43] hive/drop_partition [12:11:44] ah [12:11:54] you know, qchris, i think for me it is more concept ordering than it is hierarchy [12:11:58] so [12:11:59] in that example [12:12:21] i wouldn't mind the same hierarchy, but things named [12:12:21] hive/partition_drop [12:12:21] hive/partition_add [12:12:37] that way concepts are still grouped together [12:12:40] You're taking about util :-) [12:12:45] yes [12:12:46] util is separate for me. [12:12:48] oh ok' [12:12:52] separate from naming altogether [12:12:55] ok, well that was just an example [12:13:01] I was more thinking of webrequest directory. [12:13:03] same for non-util [12:13:15] i would be ok with webrequest/partition_add [12:13:18] and even [12:13:24] webrequest/done_flag_monitor :p [12:13:35] Hahaha :-) [12:13:38] hmm [12:13:41] ah hm [12:13:57] Ok. I see. You really, really care about the ordering. [12:14:15] If we have the ordering, we can also have the "/". [12:14:18] yeah i think so, espescially in the directory names, but probably a little also in the files [12:14:41] as for the generate_tsvs thing [12:15:09] So we should change [12:15:13] webstats/generate_hourly_files [12:15:17] webstats/insert_hourly_pagecounts [12:15:19] to [12:15:22] webstats/hourly_files/generate [12:15:26] webstats/hourly_files/insert [12:15:37] (or something the like) too? [12:15:50] ya something like that (not so sure about 'hourly_files') but yeah. [12:16:06] although, in that particular example, the object is actually different there [12:16:14] webstats/hourly_files [12:16:15] webstats/table :/ [12:16:27] (I'm not suggesting that ^!) [12:16:27] My "overengineered" alarms get off pretty heavily :-) [12:16:31] But I guess you win here. [12:16:52] Next to those alarms, I have no real counter-argument. [12:17:22] wait, how'd I win? the ordering argument? [12:17:36] You wanting it really hard made you win. [12:17:39] hahahha [12:17:41] I lack hard counter-arguments. [12:17:44] awww come onnnn! [12:18:01] No really. I could only block. But with no real hard reason. [12:18:05] That'd be pointless. [12:18:16] qchris, btw, i think it is going to be really hard to obey this or any rule for all circumstances [12:18:36] True. But it should serve as guiding principle. [12:18:46] i like to organize things this way, and it has worked pretty well in the past, but only if you just try to stick to the convention...and yeah, just let it guide but not rule you [12:18:49] hm [12:18:54] well, what to do about tsvs? [12:19:21] Since it's hard to copy paste again and again in IRC ... [12:19:30] Let's work it out in [12:19:33] http://etherpad.wikimedia.org/p/analytics-bikeshed [12:19:35] and talk here [12:19:36] haha ok [12:20:41] Which verb to take under /webstats? [12:21:46] (PS1) Milimetric: Fix knockout for firefox [analytics/dashiki] - https://gerrit.wikimedia.org/r/164541 [12:21:58] awww, i agree we should call these pagecounts-all-sites, but webstats is so much cleaner :/ :) [12:22:00] too bad. [12:22:22] cleaner in what sense? [12:22:38] I mean ... we could keep webstats ... [12:23:12] the only reason we are choosing that name is to stick with the public convention, i think webstats is a better name aesthetically. But! I need to remember that this is a 'stop-gap' anyway [12:23:15] i shoudln't worry about it [12:23:18] anyway. thinking. [12:23:30] ok. [12:23:31] Can we somehow get the integrity check in the name of "./webrequest/partition/add"? [12:23:43] s/in/into/ [12:24:17] hashar: Hi! :-) [12:24:45] just kidding :D [12:24:59] You've got me at UUID :-) [12:25:12] watching our VPE introduction https://www.youtube.com/watch?v=GJGC9zpbJpU&feature=youtu.be :D [12:27:43] Seeing the suggestions ... I guess I do not like the "partition" in the name. [12:27:54] All of the jobs operate on/for partitions. [12:28:00] But only two have it in the name. [12:28:05] That is skew. [12:29:55] ottomata: Why would you not use a second level for pagecounts-all-sites? [12:30:14] Like something between pagecounts-all-sites and archive. [12:30:23] still thinking, reading your patch [12:31:02] k [12:34:25] qchris, stil thinking, but i am coming around to your no 'partition' argument for webrequest (not for util) [12:35:02] AGH [12:35:05] why is etherpad so bad at pasting [12:35:10] So to have an example: [12:35:17] it always converts my newlines to spaces [12:35:43] A pagecounts' file also exacly covers a partition. If we use "partition" in webrequest, shouldn't we also have [12:35:54] but i think i would say I'm not putting anohter level in, because the final level for both things there are operating on pagecounts-all-sites [12:35:55] pagecounts-all-sites/partition/dump [12:35:56] like [12:36:50] pagecounts_all_sites = PagecountsAllSites.new() [12:36:51] pagecounts_all_sites.generate(year-month-day-hour) [12:36:51] pagecounts_all_sites.archive(year-month-day-hour) [12:36:56] Wait ... you're current suggestion strips a level of abstraction. I love that! [12:37:08] yes, i am saying i am coming around [12:37:30] not liking webrequest/tsvs/ yet, but maybe... [12:37:52] You also argued in the CR that "tsvs" is bad. [12:37:59] What about legacy_tsvs [12:38:02] udp2log_tsvs [12:38:12] That makes it clearer what kind of tsvs it is about. [12:38:33] oh, yes i haven't even begun thinking about the actual noun [12:38:35] haha [12:38:43] i'm still just thinking hierarchy [12:38:48] Ok. [12:38:59] I like your hierarchy. [12:39:11] The hierarchy part, I buy it as is. [12:39:20] hm [12:39:58] ok, hm, taking a step back from it, thinking about 'tsvs' [12:40:11] legacy_tsvs [12:40:11] hm [12:40:24] sure. [12:41:18] in general, i'm reluctant to name the dataset after the format [12:41:29] but, in this case i think its ok, mainly because it is legacy [12:41:31] hm [12:41:38] Yup. I feel the same way. [12:41:56] legacy_dump or something like that would work for me too. [12:42:08] But everyone is calling them the tsvs. [12:42:32] qchris: as for the hour, daily, etc. suggestion, i doubt we would organize the hierarchy that way, as wouldn't those concepts be abstracted in the oozie configs somehow? [12:42:45] similar to how you are abstracting the actual tsv generation by selecting a specific hive script [12:42:46] ? [12:43:00] yeah, since everyone is calling them tsvs [12:43:01] I am not sure. Maybe. [12:43:02] let's just call them that [12:43:17] yeah, im' not sure either, meh, let's worry about that if/when it happens [12:43:33] pagecounts-all-sites is hourly by definition right now [12:43:36] Ok. I removed them. [12:44:03] qchris, how do we feel about underscore vs. hyphens here [12:44:11] pagecounts-all-sites vs legacy_tsvs [12:44:12] ? [12:44:36] On the dumps.wikimedia.org server, it's hyphens. [12:44:39] yes [12:44:45] So I figured we should use them here too. [12:44:58] just for that one dataset for historical consistency reason? [12:45:00] reasons* [12:45:00] ? [12:45:05] And IIRC you said you prefer underscore as separators. [12:45:08] i do. [12:45:28] Yes, "just for that one dataset for historical consistency reasons". [12:45:31] ok [12:45:36] ok ok let's leave that then [12:45:36] ok [12:45:51] Meh. But I am not sold on that. [12:45:52] haha, you like the structure only because there are fewer directories! [12:45:55] Underscores are fine too. [12:46:05] Yes. Fewer directories rock :-D [12:46:10] qchris: i'd be ok if the 'display name' used hyphens [12:46:23] but we used underscores everywhere else [12:46:40] * qchris does not understand 'display name' [12:46:47] You mean the directory name? [12:46:49] e.g. in dumps.wm.o [12:46:53] or name of the workflows. [12:46:54] in a 'UI' [12:46:55] oh. [12:46:59] ok. [12:47:02] if you want. :) [12:47:26] anyway, what about those names don't you like? [12:47:38] i kinda like ingest. [12:47:51] Ok. Me too. [12:49:47] i don't like ingest for pagecounts-all-sites [12:49:51] HMMM [12:49:53] ergh. hm [12:49:55] hm [12:49:59] :-) [12:50:21] Because it duplicates the data? [12:50:26] ok, i kinda like ingest for webrequest, because for that case, it is making data available for the first time [12:50:42] that is not entirely true, as the real 'ingestion' is being handled by camus [12:50:59] Oh. But note the structure! It's not general ingestion. But ingestion in the scope of "pagecounts-all-sites". [12:51:06] but, from the point of view of hive (and that is our main viewpoint in hadoop right now), the data is not available at all until it has a partition added [12:51:13] haha [12:51:19] (In that scope) the data becomes availbale for the first time too. [12:51:22] haha [12:51:27] EEEEeeeeEEEEEE that is true. [12:51:57] Maybe .... "load" like the L in ETL? [12:51:58] ingest just has a different meaning to me though, i think. ingest, to me, implies that something was sucked in from a different system [12:52:16] hm [12:52:17] load. [12:52:25] Even that would match here. The different system is the "webrequest" table. [12:52:43] Meh. I actually prefer load. [12:52:46] Load is great. [12:53:01] nawwwwwww system is a pretty general term, but i think wouldn't call webrequest a different 'system' [12:53:03] but, yeah [12:53:04] i think i like load [12:53:11] i'm ok with load for webrequest too. [12:53:29] i think i like archive better than dump though [12:53:57] Mhmm ... does archive describe what is happening? [12:54:03] Well. sort of. [12:54:15] But we could archive in so many different ways! [12:54:25] true, we could dump in many different ways too! [12:54:29] are you guys dropping Limn ? [12:54:31] Right. [12:54:39] ^ was for ottomata. [12:54:42] :D [12:55:00] qchris, i agree, hm, what if we had /wmf/data/dumps [12:55:02] I guess milimetric would know whether Limn is being dropped/replaced :] [12:55:04] instead of /wmf/data/archive [12:55:06] is that better? [12:55:08] not sure. [12:55:45] * qchris thinks. [12:56:31] From the table point of view, "archive" is actually good. Because the table will have it's data pruned at some point in time. [12:56:52] From the action point of view, dump seems more appropriate. [12:57:04] From the people point of view "make public" seems more appropriate. [12:57:31] Archive somewhat feels �old, dusty, and boring" to me. [12:59:11] ha [12:59:25] archive is such a great word! 'legacy' feels older and dustier to me :p [12:59:34] Meh. I am ok with archive. [12:59:48] Let's stick with archive for now :-) [13:00:04] haha, both archive and 'dump' do this: they have the convenience of being both a verb and a noun [13:00:15] so we could each think of them as we like and be happy :p [13:00:21] Should we also use "Archive" for webrequest instead of lecagy_tsvs? [13:00:30] hm [13:00:33] no [13:00:39] Yay. Being both noun and verb is nice. [13:00:48] Same for load :-) [13:01:10] because the legacy_tsvs are a different thing than webrequest is [13:01:17] we aren't just archiving webrequest [13:01:40] The same holds true for pagecoutns-all-sites. It does "per wiki" roll-ups. [13:01:51] (PS1) Gilles: Update actions schema version and add new actions [analytics/multimedia] - https://gerrit.wikimedia.org/r/164545 [13:02:16] (CR) Gilles: [C: 2] "Tested by tunneling to the SQL server" [analytics/multimedia] - https://gerrit.wikimedia.org/r/164545 (owner: Gilles) [13:02:21] (Merged) jenkins-bot: Update actions schema version and add new actions [analytics/multimedia] - https://gerrit.wikimedia.org/r/164545 (owner: Gilles) [13:04:09] bwer? [13:04:11] oh [13:04:13] the project counts [13:04:15] hah [13:04:15] hm [13:04:27] right. [13:04:45] diifficult chooiiiiices [13:04:57] Hahaha :-) [13:05:03] webrequest/legacy_tsvs/archive? [13:05:03] haha [13:05:05] oof [13:05:09] ahhhHHHGHHG [13:05:20] Meh. I am fine with having legacy_tsvs for webrequest [13:05:25] well, it just so happens that you are archiving AND generating in the same workflow [13:05:26] and archive for pagecounts-all-sites. [13:05:35] i'm fine with it as we have it too, qchris. [13:05:49] Ok. [13:05:53] So then it is: [13:06:00] ./webrequest/load [13:06:00] ./webrequest/monitor_done_flag [13:06:00] ./webrequest/legacy_tsvs [13:06:00] ./pagecounts-all-sites/load [13:06:00] ./pagecounts-all-sites/archive [13:06:01] ? [13:07:08] Oh no. Sorry. That does not match what you suggested in the etherpad. [13:08:10] nono [13:08:12] your suggestion [13:08:14] i haven't changed mine [13:08:16] that's cool [13:08:23] there. [13:08:27] i changed mine to match yours [13:08:36] Ok. Cool. [13:08:48] I'll prepare the patches for this change over the weekend. [13:08:53] what a productive naming discussion! [13:08:57] didn't even need any UUIDs! [13:09:03] I liked it! [13:09:04] Thanks. [13:09:21] But ... UUIDs ... we could totally do that. It would be useful! [13:10:47] (CR) QChris: [C: -1] "I will amend to make the directory" [analytics/refinery] - https://gerrit.wikimedia.org/r/162589 (owner: QChris) [13:11:35] (PS1) Gilles: Add new actions for download popup [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/164546 [13:11:37] Now hashar left ... we should have use UUIDs :-( [13:12:16] (CR) Gilles: [C: 2] "Tested with local limn pointing to production tsvs" [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/164546 (owner: Gilles) [13:12:41] (CR) Gilles: [V: 2] Add new actions for download popup [analytics/multimedia/config] - https://gerrit.wikimedia.org/r/164546 (owner: Gilles) [14:00:06] ottomata / nuria: I'm running standup "quickly" today [14:00:17] k [14:00:19] there's a fun saiku meeting happening right now [14:00:29] (which you're all invited to if you're interested) [14:00:56] iam in the standup ... let's see if it clicks in [14:01:39] hashar: no, Limn will be around for a minute [14:01:44] (a long time) [14:02:33] milimetric: thx.I was just being curious. [14:29:02] today i learned Kevin is your product manager. I though he was a second ops :D [14:30:22] hashar, yes, I got a lot of commentary during his metrics presentation [14:30:23] along the lines of [14:30:36] "Who is this guy? Product manager for analytics? HE'S NOT ON HOWIE'S TEAM! HE ESCAPED!" [14:30:50] I had to point out we had several PdMs who are attached directly to their teams [14:36:54] Ironholds: sometime that makes more sense :] [14:37:02] agreed! [14:39:09] I am watching the metrics presentation rightnow [14:39:33] sounds like the metrics presentation was my coming out party [14:41:30] kevinator: nice pres :-] [14:41:37] I love how the web interface is super simple [14:41:45] thank you [14:41:59] yet having a huge complicated backend being it "just" to draw a couple lines hehe [14:42:16] Toby and I looked at each other when Erik announced he’d be using Vital Signs at the next metrics meeting [14:42:57] backend… yeah… we’re exploring data warehouses right now to improve performance [14:43:00] isn't it going to be presented by Damon ? [14:43:43] I don’t know… Erik may hang on to the responsibility of presenting the metrics on the community [14:44:11] and milimetric Graph extension showing up ( https://www.mediawiki.org/wiki/Extension:Graph ) [14:44:19] I should watch those metric meetings more often [14:45:56] and I am off, will be back later [15:04:54] tnegrin: yoyo [15:05:10] 5 mins -- just finishing up [15:06:02] k [16:01:02] (CR) Nuria: [C: 2] Style border radii and project list [analytics/dashiki] - https://gerrit.wikimedia.org/r/164539 (owner: Milimetric) [16:01:11] (CR) Nuria: [V: 2] Style border radii and project list [analytics/dashiki] - https://gerrit.wikimedia.org/r/164539 (owner: Milimetric) [16:01:25] (CR) Nuria: [C: 2 V: 2] Style metric selector button and popover [analytics/dashiki] - https://gerrit.wikimedia.org/r/164538 (owner: Milimetric) [16:01:43] (CR) Nuria: [C: 2 V: 2] Style color swatches square and borderless [analytics/dashiki] - https://gerrit.wikimedia.org/r/164537 (owner: Milimetric) [16:02:02] (CR) Nuria: [C: 2 V: 2] Style layout to fill page [analytics/dashiki] - https://gerrit.wikimedia.org/r/164536 (owner: Milimetric) [16:05:35] * Ironholds pokes ottomata [16:06:04] See the thread "Fwd: Monthly MaxMind Receipt" when you're back? [16:06:35] Ironholds: oh yeah, i'm in 1:1 w toby, wil lask him about that now [16:07:58] we are talking about this now [16:08:00] Ironholds: [16:08:09] hah! cool :) [16:08:11] thanks both [16:09:27] so this is just connection type? [16:09:40] Ironholds: join our little meeting! [16:09:40] https://plus.google.com/hangouts/_/wikimedia.org/otto-tnegrin [16:09:46] Ironholds: come hither: https://plus.google.com/hangouts/_/wikimedia.org/otto-tnegrin?authuser=0 [16:10:13] kk [16:10:15] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c4 (christian) Seems the issue is triggered by pointing a browsing to https://metrics.wmflabs.org/server-status [16:12:30] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c5 (nuria) Then is a lawful error, I think there is no harm on that one returning a 403. Loads of interesting things on server-status:... [16:14:34] * Ironholds dances [16:14:46] now I just need the table locks on enwiki and kowiki dealt with and we are /cooking with fire/ [16:14:53] or at least, some sort of high-temperature concept or substance [16:48:25] Ironholds: did you talk to toby about what level to purchas? [16:48:29] shoudl I go ahead and get a year? [16:48:31] https://www.maxmind.com/en/geoip2-connection-type [16:48:34] all at once? [16:49:13] ottomata, it's cheapest! I can't see us getting bored with it. [16:49:22] It lets us delve into a really important part of how mobile readers and editors behave. [16:49:49] ok [16:50:40] ha, it saves $90 total :/ [16:50:42] sure! [16:53:23] cool! [16:53:30] poke me when it syncs and I'll have some fun :D [17:04:53] hrm. Anyone seen springle? [17:09:44] (PS1) Nuria: Fixing space issues with top metric bar [analytics/dashiki] - https://gerrit.wikimedia.org/r/164578 [17:10:54] Ironholds: it's 3 AM in Australia, so hopefully not [17:11:11] YuviPanda, fair point. Darn; I wanted some table locks handled :( [17:11:18] Ironholds: , does Rachel know this is coming? [17:11:20] buying this thing? [17:11:22] I don't have CC info [17:11:25] i think she or Toby will have to buy it [17:11:29] i can get her the login info though [17:11:46] I assumed they'd bill, and that's how we'd pay [17:11:49] Rachel knows it's coming [17:11:57] but, reply to the thread so they know? [17:14:05] I can't buy it myself [17:14:11] I need to enter CC [17:15:07] gotcha [17:15:31] you know rachel /is/ on IRC, right? rfarrand. [17:15:43] yes [17:15:59] cooilio :) [17:16:04] YuviPanda, you know if desktop is blocked in china? [17:16:22] Ironholds: when I last looked, no [17:16:25] only mobile https was [17:16:30] ta [17:17:11] Ironholds: hmm, according to http://www.blockedinchina.net/?siteurl=en.m.wikipedia.org it is blocked now [17:17:19] interesting [17:17:40] Ironholds: apparently blocked [17:17:44] liangent would be able to test as well [17:18:17] * Ironholds nods [17:18:19] thankee [17:18:30] Ironholds: I asked her on #wikimedia-dev [17:25:02] Analytics / Wikimetrics: Story: WikimetricsUser creates a cohort from a metric - https://bugzilla.wikimedia.org/71614 (Kevin Leduc) NEW p:Unprio s:enhanc a:None for example: 1) take the non-aggregate output of a report (say all registered users or new active editors from foowiki in a given... [17:25:54] Analytics / Wikimetrics: Story: WikimetricsUser creates a cohort from a metric - https://bugzilla.wikimedia.org/71614#c1 (Kevin Leduc) Note: Dario created the example/description above. [17:29:46] (CR) Milimetric: [C: 2 V: 2] Fix knockout for firefox [analytics/dashiki] - https://gerrit.wikimedia.org/r/164541 (owner: Milimetric) [17:32:20] (PS2) Milimetric: Fixing space issues with top metric bar [analytics/dashiki] - https://gerrit.wikimedia.org/r/164578 (owner: Nuria) [17:33:41] ping nuria [17:33:55] yes, in meeting with dan halfak [17:33:56] EL sendBeacon meeting [17:34:03] Is happening now [17:34:35] argh , sorry [17:34:37] coming [17:35:56] in there but i do not hear anyone [17:36:17] (PS3) Milimetric: Fixing space issues with top metric bar [analytics/dashiki] - https://gerrit.wikimedia.org/r/164578 (owner: Nuria) [17:37:49] (CR) Milimetric: [C: 2 V: 2] "just moved the css to the "our styles" section. It's not very fancy organization, but I'm trying to keep our stuff separate from tweaks t" [analytics/dashiki] - https://gerrit.wikimedia.org/r/164578 (owner: Nuria) [17:38:04] (PS3) Milimetric: Fix nondeterministic project selector [analytics/dashiki] - https://gerrit.wikimedia.org/r/163892 (https://bugzilla.wikimedia.org/71333) [17:41:53] ir [17:41:57] Ironholds: got a minute? [17:42:11] tnegrin, yep! [17:42:18] let's hangout [17:42:44] kk. Send me a link. [17:42:47] https://plus.google.com/hangouts/_/calendar/d2lraW1lZGlhLm9yZ19jYjM3bXU0OGNuaHRkN2hybmE4czI3b25hb0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t.c6j7qidqs491nhi7ovk9pi4h14 [17:46:46] Ironholds: tell tnegrin he needs to pony up with that CC if you want what you want [17:49:24] ottomata, I thought Rachel said she could do it this afternoon? [17:50:15] Analytics / Wikimetrics: Apache's logs containing "client denied by server configuration: /srv/wikimetrics/wikimetrics/api.wsgi" - https://bugzilla.wikimedia.org/71606#c6 (christian) While we could ignore them, those requests are coming very controlled and reliably. Instead of expecting two erroneous r... [17:51:14] Analytics / Wikimetrics: Story: WikimetricsUser is able to chain metrics - https://bugzilla.wikimedia.org/71614#c2 (nuria) I think this needs quite a bit of definition. Wikimetrics is not the best tool to "create" cohorts, and "ad hoc" query tool will be a lot better. We can chain metrics and run the... [17:52:14] Analytics / Wikimetrics: Story: WikimetricsUser is able to chain metrics - https://bugzilla.wikimedia.org/71614#c3 (Kevin Leduc) Here's the thread on wikimetrics-l https://lists.wikimedia.org/pipermail/wikimetrics/2014-October/000165.html The related story on chaining is here: https://bugzilla.wikimed... [18:00:59] nuria, re. the rate at which EventLogging is dropped on the server-side. I didn't know it was so common. Where do I find out more? [18:27:06] halfak, it's not common, we fixed abug a while back, but we are capped in throughput, so if an event goes crazy and starts logging times a hundred [18:27:15] it affects teh rest of the events [18:27:36] *the [18:28:07] halfak: so not a common scenario but if all of a sudden someone bumps up logging rates it can happen [18:28:18] nuria, I'm familiar with that then. By the way you talked about it, I thought that you meant I should expect events to be dropped all of the time. [18:29:18] halfak: no, that is not the case , we have alarms that go if we go over a certain leven of throughput [18:29:19] https://wikitech.wikimedia.org/wiki/Incident_documentation/20140509-EventLogging [18:29:29] this is the last incident [18:30:18] *certain level of throughput [18:30:20] Hokay. So I should expect that dropped events outside of such incidents are either unknown issues on the server-side or they were dropped on the client-side. [18:30:24] ^ halfak [18:30:55] halfak: or events that do not validate [18:31:03] of course [18:31:10] not necessarily dropped [18:47:21] ottomata, does hadoop log byte size of an image that we send to the client? [18:47:29] for all multimedia requests [18:48:09] hmm, yes. [18:48:13] there is a response_size field [18:48:13] i think [18:48:19] so, if you find an image request [18:48:23] the response size should be the byte size [18:54:11] ottomata, and we log ALL the multimedia requests, right? [18:54:17] we log all the requests! [18:54:22] if it is an http request to varnish [18:54:22] we got it. [18:54:34] that'll probably be in the source='webrequest_upload' partition [18:54:37] right? [18:55:10] yurikR1: btw, if you are going to some exploratory queries, it'll be best to limit your queries to a single hour, at least while you work out the query [18:55:20] ottomata, and we log ALL the multimedia requests, right? [18:55:24] you'll get results faster, as it will have to process much less data [18:55:27] i still need to find out how to do those tests :) [18:55:35] yurikR1: if it is an http request to a varnish, we log it [18:55:39] that's all I can say! [18:55:39] :) [18:55:56] so, if by 'multimedia request', you mean http request to a production site, then yes, it should be logged [18:55:57] i think we use a separate varnish to get multimedia [18:56:06] requests to uploads [18:56:06] what do you mean by multimedia? [18:56:08] yes [18:56:14] upload varnishes are logged [18:56:16] do [18:56:16] thx [18:56:34] where webrequest_source='upload', to limit your query to only requests from upload varnishes [19:14:40] thx :) [19:17:38] ottomata, another quick question - do you have some magic field that "officially" marks a request as a pageview +1 ? [19:18:35] haha [19:18:44] no way, one day we will have UDF that does that [19:18:57] you should ask Ironholds that question and see what his reaction is :D [19:19:22] .... [19:19:25] bahahaha [19:19:56] than how can you draw graphs? [19:20:01] of pageviews [19:20:03] yeah you want CREATE TEMPORARY FUNCTION identify_pageviews as 'org.wikimedia.analytics.refinery.hive.MagicUnicornMadeOfIceCreamUDF'; [19:20:26] a large amount of filtering of the raw data? [19:20:44] yurikR1: if you are speaking of the newly generated pagecounts thing [19:20:46] this is all funny , i agree :) but really need that data for zero team :( [19:20:54] that is just old webstatscollector ported to hive [19:21:01] you could use that reuse logic if you wanted to [19:21:05] yurikR1, the raw data, or...? [19:21:16] Like: the graphs should have aggregate datasets behind them. [19:21:18] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webstats/insert_hourly_pagecounts/insert_hourly_pagecounts.hql [19:21:22] if you want more granular data, well, that's a long conversation [19:21:48] basically, out of all the rquests logged, i need to get just the ones that we officially consider as a "pageview" [19:21:55] that is, the intact requests? [19:22:19] or a count of the requests? [19:22:23] yurikR1: there is no official 'pageview' definition yet. that will be Ironholds magnum opus if people ever agree on things :) [19:22:51] indeed, but we can offer the interim definition, or at least those bits people agree on. [19:22:53] not sure what you mean by "intact" requests, i need count of the requests [19:23:20] groupped by certain other params, e.g. - zero partner, or proxy (both of which we can calculate based on IP + XFF) [19:23:27] mmmn, no you can't. [19:23:28] or by http vs https [19:23:58] Hive doesn't know what the proxy list is. It also doesn't know what the range of zero partners are. So It's kind of hard for it to group by them right now. [19:24:30] if you want me to give you a count of yearmonthdayhour + zero MCC + count(*), that I can do. [19:24:31] how can we build that, because if we show nice graphs, and don't filter all irrelevant stuff, we have no valid count, just some relative numbers (today was more than tomorrow, but that could be simply because we started loading an extra resource on all pages) [19:24:33] But it'll use the sampled logs. [19:24:45] which graphs are you talking about? [19:24:48] Ironholds, zero can create a function for that [19:25:17] but what we should agree on (even if changes latar), is some magicUnicorn algo to say if it was a pagvewie request or a resource request [19:25:26] (resources being images, css, js, etc) [19:25:26] er. [19:25:41] haha [19:26:03] I guess, I'm not seeing the value of us standardising right now, for all the zero things, if (1) this will not match any other definition currently in use, and (2) it will be superseded anyway. [19:26:17] if you want me to show you our draft definition I absolutely can. But bear in mind that no production system uses it. [19:26:24] and there are known weaknesses in the production definition. [19:26:36] yurikR1: I am not involved in these discussions, but that is something that everybody wants. Ironholds is working with Dario and Erik Z and who knows who else to create a standardized pageview specification [19:26:49] qchris, EZ, dario, yep. [19:27:08] don't get me wrong: we absolutely need a standardised definition. But would you like our current best guess, or the current production version of that definition? [19:27:09] excellent, so will you run it retroactivelly on all past data? [19:27:17] once the spec is agreed on, we can implement some stuff that will give us a Hive UDF that will let you do something like [19:27:18] well, we only have 90 days of data [19:27:26] select bla bla from webrequest wher ebla bla and is_pageview=1 [19:27:27] or somethign. [19:27:29] or [19:27:35] but if you were to ask "Oliver, do you have zero pageviews broken down by MCC over the last 16 months" the answer would be "yes". [19:27:36] best guess is probably going to be much better than my own guess :) [19:27:44] so yes, is_pageview field is what i need [19:27:49] select where is_pageview(request_uri, ip, xff, ...) = 1 [19:27:56] okay. That doesn't exist yet and won't for several months. [19:28:17] In the meantime I can show you our standardised definition if you want to implement something for zero pageviews, or just give you aggregates from the sampled logs that use a matching definition. [19:28:27] Ironholds, can we have some magic, possibly client-side, or some extra WHERE clause parts for that guess? [19:28:28] *implement something on hadoop [19:28:36] yep, that would be good [19:28:40] yurikR1: you could take the where clauses from this [19:28:43] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webstats/insert_hourly_pagecounts/insert_hourly_pagecounts.hql [19:28:44] okay, see, here's the problem ;p [19:28:49] that is what generates [19:28:50] I offered you two options, and your reply was "yes" ;p [19:28:54] http://dumps.wikimedia.org/other/pagecounts-all-sites/ [19:29:08] pick one. the hive query or the existing data source, I don't mind. But, pick. [19:29:44] OR, if you are feeling especially crazy [19:29:48] i guess i need a clearer understanding first :) 1) can we import old logs into hadoop? we have tons of them [19:29:49] build and use this: [19:29:49] https://github.com/wikimedia/kraken/tree/master/kraken-generic/src/main/java/org/wikimedia/analytics/kraken/pageview [19:29:51] for the past year [19:29:52] make a Hive UDF out of it [19:29:53] :) [19:29:59] and client-side magic, eh. We have talked about having pageid=foo added to the x_analytics field, or a pageview=1 element, but that'll be discussed more in the analytics meetup [19:30:05] yurikR1, (1) what old logs? [19:30:19] zero/ on stats1002 [19:30:26] yurikR1: you can put whatever you want into hadoop :") [19:30:34] and, if oyu want to use hive to process them [19:30:42] but i don't want to dup existing stuff [19:30:43] you can map a hive table on top of arbitrary (structured) data [19:30:56] yurikR1: there isn't much exisiting stuff right now [19:31:03] ottomata, that's why i want to sit down with you and go through all the zero needs and how we can do them [19:31:13] sounds fun! [19:31:18] oh yeah :) [19:31:35] yurikR1: you need historical stuff? [19:31:38] Okay, this is going all over the place. [19:31:42] can you just start with data that exists now? [19:31:46] https://meta.wikimedia.org/wiki/Research:Page_view is the draft "standardised" definition [19:31:51] build something, and then start generating results? [19:31:54] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webstats/insert_hourly_pagecounts/insert_hourly_pagecounts.hql is the current production definition [19:32:16] the pageviews table in the staging db on analytics-store.eqiad.wmnet has counts using the standardised definition, broken out by country and MCC for zero. [19:32:23] *draft standardised definition, broken out [19:32:27] so yes, 1) import old data, 2) run some magic unicorn func to determine what is a pageview, 3) dynamically identify what is zero on all hadoop traffic (based on IP+XFF) 4) create summary reports [19:33:00] why would you use IP+XFF? [19:33:01] yes, i will start with using existing data before trying to import more [19:33:07] why not just use the x_analytics field? [19:33:18] because we don't mark a lot of traffic yet [19:33:25] * Ironholds blinks [19:33:25] because we still have to support LIMN graphs for zero [19:33:39] wait, there's a ton of zero traffic that doesn't have an MCC in the x_analytics field? [19:33:44] e.g. - if the partner does not support FR language, we won't mark it with X-Analytics [19:33:52] correct [19:34:01] oy. Well, this is going to throw all the draft data off. Grand. [19:34:01] we do varnish-side filtering for it [19:34:17] okay. [19:34:17] that's my point - we need to switch away from your graphs (and we already have alternative) [19:34:25] cool! [19:34:28] all i need is to access the data [19:34:34] and properly process it [19:34:36] I thought you had access to the data? [19:34:42] i was using log files [19:34:54] for magic unicorn function, you can use the existing definition, linked above, or the draft standardised definition, also linked above [19:34:55] which are only for the stuff that has zero= in x-analtyics [19:35:04] right [19:35:15] for dynamic identification, if you mean "as the request comes in", that's not going to happen, unless ottomata knows something important I don't. [19:35:27] plus, we want much more data - e.g. what portion of DESKTOP, or MULTIMEDIA came from zero partners [19:35:34] we don't mark that in varnish [19:35:35] define "desktop". [19:35:39] requests to the desktop site? [19:35:41] en.wikipedia.org [19:35:42] yep [19:35:45] * Ironholds nods [19:35:48] and, multimedia..? [19:35:50] yep [19:36:00] no, I mean "what does that mean". Requests for images? [19:36:01] partners need to know the whole zero. vs m. [19:36:06] yep [19:36:09] okay. [19:36:24] plus the image sizes from those logs :) [19:36:36] this way we can start analyzing data usage per partner [19:36:44] there is a lot of fun stuff we need :) [19:36:51] well, image size is (I guess) an element of the filename. Most of the time. [19:36:59] filename??? [19:37:08] i thought it was a field in the request [19:37:16] as well as content size there's also pixel size (I didn't know what you meant by "image sizes" when I hit enter [19:37:28] response_size [19:37:32] my bigger worry there, is I have no idea if zero.upload or m.upload exist. [19:37:37] ottomata? [19:37:46] they don't as far as i know [19:37:51] as uris? [19:37:53] i unno! [19:38:00] we have uploads and we have bits [19:38:04] okay. Then how are you going to identify zero multimedia requests? [19:38:43] there are no zero multimedia requests as far as we are concerned really - we will simply count them as "multimedia for that partner", and we will know if it is zero-rated based on partner configs [19:38:57] so, IP + XFF? [19:39:17] ip+xff is what defines a) partner, b) proxies, e.g. opera or nokia [19:39:35] indeed [19:39:38] and sometimes that also defines a subset of partner's ips (if they have multiple infrastructures) [19:39:42] yurikR1: upload varnishes don't set MCC based on ip+xff? [19:39:48] nope [19:39:56] why not? should they? [19:40:02] because that might throw off limn graphs :) [19:40:15] i have no idea how you filter data for them, hence didn' want to touch [19:40:17] yes they should :) [19:40:17] okay. Pick a definition, use that definition, amend it to include multimedia? Whichever definition you pick will be superseded. I'd recommend the draft one but I'm biased because I wrote it. [19:40:29] althugh at some point we might rely on UDFs entirely for them [19:40:29] oh because they are 'views'?. pssh, have whatever creates those just filter out upload. requests? [19:40:56] yurikR1: please do write Hive UDFs, if you please :) [19:40:59] well, not even upload. [19:41:01] MIME type filtering. [19:41:03] It is a thing. [19:41:08] nobody really yet has experience doing it here, I've done it once (this week) for oliver [19:41:11] just a quick hacky one [19:41:19] (it's an awesome quick hacky one. Geolocation!) [19:41:19] I have never worked with hadoop, so all this is new to me [19:41:27] yurikR1: java experience? [19:41:27] oh yeah, we want that one too :) [19:41:31] yes [19:41:33] you can do it [19:41:35] its super easy [19:41:41] its just writing a method [19:41:43] i might as well learn hadoop internals :) [19:41:45] and then registering it with hive [19:42:00] the only hadoop-y part is the Types of the variables you can take as input and return as output [19:42:09] i understand, its more of a hooking it up to hadoop and using the whole hadoop thingy is what i need to learn, i'm sure its not rocket science [19:42:12] (its worse [19:42:14] http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html [19:42:20] well, with hive [19:42:24] you don't really have to care [19:42:34] just glance at that page [19:42:38] the part about the simple UDF [19:42:53] ottomata, lets do all that when we discuss it and i take proper structured notes [19:42:55] ok ok :) [19:43:03] btw, there's lots I don't know either :) [19:43:08] i'm learning this stuff too [19:43:21] i hope so! i would hate to talk to an "know-it-all" ;) [19:43:33] heh [19:43:43] just some "hello world" stuff to get me started [19:44:06] ok, lets chat in a bit, need to wrap up my SF visit with all the other stuff, chat either next week, or when you get back to NY [19:44:11] Ironholds, thanks!!! [19:44:27] np [19:44:31] (and ottomata thanks to you too :))) [19:46:26] yup :) [19:50:26] just out of curiosity - why not populate is_pageview now, and later correct it? [19:51:09] yurikR1: how? [19:51:13] client side in x-analytics? [19:51:25] the raw logs that we have, have only data from varnish [19:51:33] we do want to do an 'ETL' phase [19:51:41] that keeps more logs, geocoded, anonymized, with extra goodies like that [19:51:47] but, that will be a different dataset and a different table [19:52:16] there are a lot of moving parts to that, including learning the best ways of how to store it, what format, etc. etc. [19:52:22] it hasn't been prioritized yet [20:25:18] (PS1) Ottomata: Fix usage comment in oozie/util/hive/partition/add/add_partition.hql [analytics/refinery] - https://gerrit.wikimedia.org/r/164650 [20:25:31] (CR) Ottomata: [C: 2 V: 2] Fix usage comment in oozie/util/hive/partition/add/add_partition.hql [analytics/refinery] - https://gerrit.wikimedia.org/r/164650 (owner: Ottomata) [20:28:11] (PS1) Ottomata: Add oozie util workflow to drop partitions from Hive tables [analytics/refinery] - https://gerrit.wikimedia.org/r/164653 [20:31:05] now I don't see leila at all [22:36:18] Analytics / General/Unknown: Requests for 'undefined' page increasing on wikipedias - https://bugzilla.wikimedia.org/66352#c5 (Toby Negrin) I am adding Timo and Trevor to this bug because I cannot answer any of their questions. Thanks folks. [22:39:07] ^ most honest bug report in all time. [22:39:16] tnegrin wins a point [22:41:46] Analytics / General/Unknown: Requests for 'undefined' page increasing on wikipedias - https://bugzilla.wikimedia.org/66352#c6 (nuria) Created attachment 16666 --> https://bugzilla.wikimedia.org/attachment.cgi?id=16666&action=edit [undefined] pageviews to wikipedia as of last month [undefined] pagevi... [23:26:15] Analytics / General/Unknown: Requests for 'undefined' page increasing on wikipedias - https://bugzilla.wikimedia.org/66352#c7 (nuria) Request data can be found, for example, here: http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/ Or in stat1002:/a/squid/archive/sampled [23:29:45] Analytics / General/Unknown: Requests for 'undefined' page increasing on wikipedias - https://bugzilla.wikimedia.org/66352#c8 (nuria) Wikistats: http://stats.grok.se/en/201410/Undefined 'Undefined' top 24 article.