[00:18:21] (PS4) Terrrydactyl: Added delete wiki user functionality. [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/124878 [10:45:06] https://hacks.mozilla.org/2014/05/reconciling-mozillas-mission-and-w3c-eme/ [10:51:29] (CR) Nuria: "Preliminary CR. We shall touch base once we know we are going ahead with the feature. There is still work needed regarding UX, UI, except" (7 comments) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 (owner: Terrrydactyl) [10:55:03] (CR) Nuria: "Also, one other thing. Alembic downgrade is not working." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133091 (owner: Terrrydactyl) [11:08:22] (CR) Nuria: "I think this approach is not the best as it is not very likely you can match users freely typed on search bar on this fashion. Also this " [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/124878 (owner: Terrrydactyl) [13:31:48] hey all, i have to move locations, i should be back on by the standup, hoping my paris bike navigation skills do not fail me!@ [14:00:03] ERghghhh [14:00:04] i am online, maybe [14:01:20] ok milimetric et. al., moving up to office, my friend found me I am not lost [14:01:25] back on asap [16:40:50] hey folks [16:40:56] have you seen nemo's e-mail and my reply? [16:41:45] nuria: did you reply to the right thread? :) [16:42:00] ay .... ay .... [16:44:23] ori: well, i think so. i thought nemo wanted to know if we had done anything on anonymization front of EL data so i send him an update from what we talked about last ...but maybe i missed the point entirely [16:45:00] yeah, the point is that the rate of events has spiked massively [16:45:11] is there a graph that lists mediaviewer EL explicitily or you just looked at vanadium logs? [16:45:13] operational implications first, moral implications later [16:45:22] so, i was about to say [16:45:33] i have a script hacked together, would be good to add it to the repo somewhere [16:45:36] let me put it up on gist [16:45:40] ok [16:46:20] * milimetric catches up with email [16:47:21] ah, interesting, though only one order of magnitude increase - is this alarming from a storage point of view? [16:47:27] the rest of the architecture should hold up fine [16:47:52] ah yes.... :) 20 million rows a day seems bad :) [16:49:51] ori: is there a way (besides looking at /var/log/eventlogging# tail -f client-side-events.log ) to see the logging "per event"? [16:50:32] https://gist.github.com/atdt/8deed4bc2d311ba0122f#file-el-status-py [16:50:40] who knows where the hive DB drivers live on stat2? [16:50:43] try running that on vanadium [16:56:07] * Nemo_bis didn't mean to cause havoc [16:56:30] Nemo_bis: it's great that you noticed that, i really appreciate the heads up [16:56:57] milimetric, nuria: ping? [16:57:06] okeis i see super hight tech version on tail -f [16:57:18] *of "my tail-f" [16:57:48] but the action is to tell mediateam right? [16:58:02] yes, but we know what the fix is [16:58:07] we cannot page them [16:58:08] i'm not allowed into vanadium :( [16:58:14] do you know what controls the rate of events for media stuff? [16:58:35] marktraceur: ^ [16:58:35] milimetric: if you edit the script and change 'localhost' to 'vanadium.eqiad.wmnet' you should be able to run that from anywhere on the cluster [16:58:37] I do not [16:58:44] there's a config var [16:59:07] don't ping marktraceur, i'm giving you clues :P [16:59:17] but... we're changing his stuff [16:59:26] i mean, he should at least know :) [17:00:02] do you have mediawiki-config cloned? [17:00:27] any time you need to know something about how mediawiki or one of its extensions are configured you should grep that repo [17:00:39] no ori, we never work with mediawiki code [17:00:51] I rather think teh action should be 1. turn of consumer [17:01:00] 2. notify media folks [17:01:14] well, i notified them by CCing the thread to multimedia [17:01:14] 3. turn consumer up when issue is solved [17:01:18] and pinging gilles [17:01:26] but turning off the consumer will stop all events [17:01:31] so i don't see how that's a solution [17:01:32] i can clone it real fast though [17:01:35] we can fix this the proper way [17:01:38] so clone it :P [17:01:53] cloning [17:01:58] nuria: git clone ssh://milimetric@gerrit.wikimedia.org:29418/operations/mediawiki-config [17:02:10] (not milimetric of course) [17:03:47] ori: ok cloned [17:03:49] k, mediaviewer wise I see the dblist and a few settings entries [17:05:05] nuria do you know what we're looking for here? :) [17:05:14] no, i do not [17:06:15] the block of config for an extension is usually nested in an if() [17:06:25] with if ( $wmgUseMyExtensionName ) [17:06:30] in the case of media viewer, wmgUseMultimediaViewer [17:07:06] in line 1824 of CommonSettings.php [17:07:11] if ( $wmgUseMultimediaViewer ) { [17:07:11] require_once( "$IP/extensions/MultimediaViewer/MultimediaViewer.php" ); [17:07:11] $wgNetworkPerformanceSamplingFactor = $wmgNetworkPerformanceSamplingFactor; [17:07:23] ah, sampling factor [17:07:24] cool [17:07:39] * marktraceur watches silently [17:07:40] wmgNetworkPerformanceSamplingFactor is defined per-wiki in wmf-config/InitialiseSettings.php:10808 [17:08:06] 'wmgNetworkPerformanceSamplingFactor' => array( [17:08:06] 'default' => 10, // beta feature users do not generate enough data with 1:1000 [17:08:06] 'mediaviewer' => 1000, [17:08:08] ), [17:08:47] so let's add a zero to each of those numbers [17:08:47] since gerco has responded to the e-mail thread perhaps he can do needed chnages? I am no fond of changing code i cannot test. [17:08:55] what's "default" though? [17:08:57] *changes [17:08:57] ah let me read his reply [17:09:11] milimetric: the value for all wikis not explicitly configured by name [17:09:45] all... extensions you mean? [17:09:55] 'mediaviewer' sounds like an extension [17:10:12] all wikis [17:11:36] hm... the comments in this config doc are confusing... [17:12:36] i asked tgr to join the channel so while we wait i'll explain [17:12:50] mediawiki and extensions are basically configured using global vars [17:13:04] there's an effort to improve that since it's not a particularly elegant architecture, but that's besides the point right now [17:13:11] right, i saw at the summit [17:13:16] but there are different configs for different wikis [17:13:17] so these are the globals they were talking about, ok [17:13:23] yet they all run on the same hosts [17:13:33] so basically you need a way to specify what the value should be for each wiki [17:13:49] got it, and "default" is that way [17:13:59] but how is "mediaviewer" a wiki? [17:14:07] i don't know, that is rather bizarre [17:14:15] :) [17:14:25] hi tgr [17:14:28] hey tgr, thanks [17:14:30] hi [17:14:51] we are logging about 10M events a day, I think [17:15:01] and have not deployed to enwiki yet [17:15:50] do you have a number we should reduce our daily log count to? [17:15:52] let's increase the sampling factor [17:16:10] no, and you're right that it's an oversight [17:16:27] there aren't any clear-cut guidelines [17:16:50] yeah, we probably need something other than mysql to handle that kind of traffic though [17:16:54] but let's up the factor by a factor of 10 if you don't mind [17:16:58] so the sampling factor is 1:1 now [17:17:08] milimetric: we're using tokudb now and it's handling the load quite nicely [17:17:20] ok, cool [17:17:35] tgr: is it useful to have every single datapoint? [17:17:37] we are using 1:1000 for network logging which would be dozens of queries per page view [17:17:46] we could go with that [17:17:58] it is certainly useful :) [17:18:11] how useful is another matter [17:18:15] i sort of assumed from the graphs that it wasn't intentional to log at this rate [17:18:25] as I said in the mail, we don't actually use the MediaViewer schema much [17:18:42] for decisions or anything [17:18:48] okay, so factor of 10 sounds right [17:19:06] it looks like everything is keeping up, so we're not actually hitting an operational limit [17:19:15] but the volume is higher than what we're used to, which means it needs to be watched closely [17:19:16] mmm, if you're not using it much at all, maybe even 1/100[0] is fine for now [17:19:18] and it's basically a daily count, sample by X, multiply results by X, end result should be all right [17:19:29] but we should also think whether can we can analyze 20 effectively million data points per day [17:19:45] nuria: yes, very valid point [17:19:50] the duration logging schema is a new thing, we don't have any data processing for it yet [17:20:16] tgr: ok, so 1:100 ? [17:20:33] but we'll probably use it for percentiles and geometric means, I'm not sure how unbiased those would be with sampling [17:20:55] ori: sure [17:21:17] could you submit a patch, and i'll check with greg-g that it's ok to sync it? [17:21:40] is this a short-term fix or a long-term status quo? because if it is long term, I might have complaints :) [17:22:06] i think we need a long term decision here [17:22:08] not specifically for multimedia logging, I mean, it just seems to be the wrong side of things to optimize on [17:22:44] tgr: if there's a credible need for unsampled data, we can log at that rate [17:22:50] but credible need is a prerequisite [17:23:09] also we must be able to analyze at that rate [17:23:10] it doesn't seem especially justified to hoard data when you don't have a concrete use-case [17:23:17] fair enough [17:23:24] yeah, logging is not free basically, so we should be conservative with it as we are with other resources [17:23:47] milimetric: any guidance about how sampling will affect percentiles/geometric means? [17:24:10] i think halfak might have something to say about that, my stats knowledge stopped growing in high school :) [17:24:27] or lzia ^ [17:25:10] I just logged back in and I missed some conversations, milimetric. [17:25:10] ori: if we were to change the sampling rate, it seems like I'm missing one of the places where the 1:1 logging is configured [17:25:22] What's the question? [17:25:30] I imagine percentiles would not change other than getting more noisy since tha ratio of below/above percentile limit events would be the same? [17:25:37] no prob lzia, so if you have a few minutes, the question is: how will samping 1/X affect geometric means and percentiles? [17:25:43] but I have no clue about geometric means [17:25:44] tgr correct [17:26:20] we're talking about reducing the rate at which the multimedia viewer is logging information [17:26:24] but you do not need hundreds of millions of data points to have pretty trustworthy calculations [17:28:11] depends what you want to do with them and how much data you have, milimetric. [17:28:21] what do you want to use the data for? [17:28:41] so tgr, correct me if I'm wrong but this is the data that backs your network performance visualizations? [17:28:56] milimetric: yes [17:28:59] so for example: multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab [17:29:01] oops [17:29:02] http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab [17:29:11] lzia: ^ [17:29:43] milimetric: the amount of data is total overkill for those graphs [17:29:44] so one of those graphs is using ~ 10M datapoints per day to create geometric means for each country [17:30:01] well... tgr, I'm not sure it's so clear cut [17:30:02] but can be useful if we need to identify problems with rare browsers, for example [17:30:25] milimetric: no, the network graphs already use 1:1000 sampling [17:30:41] I'd expect there to be a need to sample at a rate higher than the rate at which the least participant countries generate data [17:30:44] the activity charts are the culprit in the traffic [17:30:51] the first tab in your link [17:30:54] ah ok [17:30:56] tgr: we have the green light from greg-g to sync a change [17:31:00] tgr: do you have a patch? [17:31:17] ori: I think we're still figuring out how much is ok to throttle back [17:31:21] oh sorry [17:31:34] please discuss, there's no pants-on-fire urgency [17:31:45] i got distracted because multitasking [17:31:47] so, I guess, tgr, what's an example where you are afraid sampling at a lower rate will hurt [17:32:06] milimetric: I don't care terribly much about the activity data, we never found much use for it [17:32:11] k [17:32:20] the duration data is more important [17:32:38] that's new, we don't have graphs yet, I think we deployed it yesterday [17:32:40] so maybe if you could explain how / where you're using that data for lzia to be able to guide us [17:32:53] time from click on thumbnail to display of lightbox [17:33:14] right now we would just make charts per wiki, and maybe per country [17:33:15] gotcha, so you're tracking that at 1:1 right now? [17:33:21] yes [17:33:45] in the future, it could be useful to be able to filter on rare events [17:33:57] like, loading times in IE6 [17:34:09] (this is really in the future as we don't even support IE6 yet) [17:34:20] our % of IE6 req is tiny [17:34:30] If that's the case, don't worry about those rare events right now, tgr. [17:34:33] but i get it , sure [17:34:36] :) there's some suspicion that IE6 has a lot of usage in China, etc. [17:34:39] damn thing won't die :) [17:35:05] so saying we should just reduce the sampling ratio when we actually have a use case is fair, I guess :) [17:35:05] If you think you will need to check them out later, you can sample them fully at a later point for a short period of time to get a sense? [17:35:17] yeah, agreed with lzia [17:36:03] as for the countries, maybe we could have modify the sampling ratio for countries/wikis with small traffic? [17:36:18] tgr, it sounds like you're saying: the are valid reasons why you might want to log unsampled data, and the analytics infrastructure should handle the load generated by that gracefully. i think everyone agrees with both points. [17:36:53] that would be tricky to analyze tgr, different sampling rates - seems like maybe you'd want different schemas [17:37:30] but ok, for now, it sounds like we can turn down the firehose. So, how do you guys do this? Sync to prod and pray it works? [17:38:19] I'll make a patch, sec [17:38:31] milimetric: why pray? [17:38:51] i.e., what about this seems precarious? [17:39:07] tgr: I can look at few sources to see if I find a rough estimate how many you need to sample. [17:39:07] ori: typos? :) [17:39:10] I was tlzia that the sampling rates have used for percentiles before were 10.000 samples to calculate 90th 100.000 for 99th and 1mil for 99.9 for performance [17:39:21] metrics [17:39:25] until then, what nuria is suggesting may be something you want to go with [17:39:41] My gut feeling is that you don't need all the data [17:40:10] milimetric, nuria: do you guys have RT accts? [17:40:15] You can also test this by comparing the same map for two similar weeks or days, one sampled fully, and one sampled with the estimate nuria suggests. [17:40:30] see if you find any difference which justifies logging everything. [17:41:05] I do not have RT, ori, every time they give it to me it goes away after a bit [17:41:13] We always reported 99.9 and 99.99 so sometimes we had to go to 10 millions but for performance 90, 50 and 99 give you pretty much all you need. [17:41:20] ori: i do [17:41:40] nuria: yeah, 99 is the highest we use [17:41:57] then 100.000 per item reported [17:42:18] should be plenty , as in if you report distintevely per browser [17:42:31] ~100.000 per browser [17:42:53] well, so 1/1000 sounds fine, since nobody logs less than that [17:42:59] normally we would log at a high rate for a very brief time, to estimate logging rates [17:43:35] and later lower sampling [17:44:11] ori: what shoul i do in rt [17:44:38] nuria: i'll explain in a moment, let's sort out the sampling factor question first [17:44:47] so ori, from your script before: [17:44:47] MediaViewer 61.78% (170.75/sec) [17:44:47] MultimediaViewerDuration 13.42% (37.08/sec) [17:44:57] * ori nods [17:44:59] sounds like we can turn both to 1/1000 [17:45:27] well, nuria's saying 1/100.000 (unless I misunderstood), but 1/1000 is the lowest anyone does so that seems fine [17:45:34] don't ask me, i skipped all my math classes in high school [17:45:34] no [17:45:53] i am saying that the dataset you are analyzing to report 90th percentile [17:46:02] operationally i can tell you that 1:1000 and even 1:100 are totally fine [17:46:07] has to have at least 100.000 points [17:46:22] thus if you report daily [17:46:28] Oh sorry, I forgot to watch silently. Are we causing too much load on the EL server? [17:46:33] actions [17:46:40] 100.000 per action per day [17:46:56] yes marktraceur [17:47:02] Hopefully now is more clear, sorry [17:47:12] well, not too "much" but maybe "unnecessary" [17:47:26] let's put it to 1/1000 [17:47:29] ok, so to get down to 100K per day from 10M, we can do 1/100 [17:47:32] 'kay - am I understanding that we'll see a patch soon? [17:47:36] see sample size [17:47:37] 1/1000 would put us down to 10K per day [17:47:41] and we can adjust as needed [17:47:42] marktraceur: i certainly hope so :) [17:47:42] yes marktraceur, patch soon [17:47:46] sounds sane to me [17:47:51] nuria: i'll go with 1:000 for now, and follow up on-list [17:47:52] woot [17:47:58] ok tgr, cool, thanks [17:48:01] tgr: thnks [17:48:11] I realized it's more complicated for user actions since some of them are rare [17:48:48] right, then sampling rates have to be per action [17:48:48] there are millions of lightbox open events but only thousands of check author events, etc [17:49:13] you can have a filter client side that futher reduces rates [17:49:25] it sounds like you might only need to sample open events [17:49:28] and "increase" the "base" sampling rate [17:50:32] tgr: it's also ok to log everything unsampled if there is a credible need. we have the infrastructure to handle it. but we try to formulate questions first and have an analysis plan in place as a precondition for accumulating data rather than the reverse. at least with eventlogging data. [17:52:30] nuria, milimetric: i pinged mark b. about procuring additional nodes for EL since vanadium is a SPOF at the moment [17:52:42] he was in favor but said ideally we'd wait for the new data center [17:52:47] which will happen in a month or two [17:52:49] "SOPF" -> nuria has to google this [17:52:52] cool, sounds good ori [17:52:57] single point of failure [17:52:58] nuria: single point of failure [17:53:08] he asked to file a ticket to track it, so i filed #7509 [17:53:12] you guys are both CC'd [17:53:38] aahhh, see being dislexic is real hard [17:53:41] you can add qchris or whatever, i'm not sure who the contact person is [17:54:34] up to you guys. it doesn't require any action -- the next step is for ops to reply in a month or two when the dc is up [17:55:25] someone should also look at the graph nemo linked to once a day [17:55:42] someone other than / in addition to Nemo_bis, i mean :) [17:56:24] the rate of logged events is also reported to graphite, so perhaps we should add a dashboard to gdash [17:56:31] i can probably do that [17:56:35] sounds to me we need to use your script and pipe it to a monitoring something that looks at rates [17:56:44] and alarms [17:56:44] that's already happening [17:56:54] well, that graph is basically that [17:56:56] nuria: go to http://graphite.wikimedia.org/, expand eventlogging [17:57:10] under 'schema' you have per-schema rates [17:57:17] thanks ori, and thanks Nemo_bis for the report, much appreciated [17:57:25] but wait ...that is not an alarm right? that is a graph [17:57:30] or is there an alarm too? [17:57:32] no, not an alarm [17:57:40] ori - I'll set an alarm for myself to look at this once a day. [17:57:41] that's a good idea, tho [17:58:02] doesn't icinga do this sort of thing? [17:58:10] Good [17:58:14] like can't we configure it to yell at us if we go over 200 events/second or something [17:58:17] yeah, we have icinga alerts for eventlogging services, but not for the volume of events [17:58:23] right [17:58:25] but yeah, i'd ask _joe_ about that [17:58:32] he recently updated the anomaly detection graphite alerting scripts [17:58:41] who's joe? [17:58:45] he's a new ops person, giueseppe [17:58:49] very smart and very nice [17:58:50] need to leave my cooking space [17:59:00] nuria: thanks for your help with this! [17:59:02] cool, thanks ori, I'll ping him [17:59:17] milimetric: thank you too obviously, was just saying so to nuria 'cause she's leaving [17:59:24] ditto tgr and lzia [17:59:38] and nemo for flagging [17:59:53] np, i'm happy to make simple changes slow and laborious for my own benefit anytime :) [18:00:04] tgr: patch? [18:00:56] ori: https://gerrit.wikimedia.org/r/133750 is the operations half [18:01:07] the other half is coming up [18:01:22] tgr: do they need to be synced in a particular order? [18:01:28] no [18:01:41] tgr: cool, is it ok if i sync the operations half now? greg-g okay'd it [18:01:52] sure [18:01:59] thanks, doing so [18:01:59] so tgr, whereas other settings in here break things down by wiki, you're breaking them down by extension? [18:02:31] milimetric: I am not sure I understand what you are referring to [18:02:52] your patch above, I'm trying to understand it so in the future you can just be like: hey, analytics drone, change my sampling! [18:02:55] and I can know what to do [18:03:47] tgr: the keys of config values in InitialiseSettings.php typically correspond to wiki names or group names [18:03:53] is mediaviewer a group? [18:04:29] so like, the wmgMediaViewerSamplingFactor dictionary has 'default' and 'mediaviewer' as the keys... yeah, what ori said :) [18:04:30] ori, yes, wikis which have it enabled by default [18:04:41] got it [18:05:18] milimetric: so we have two sampling settings now [18:05:32] yep, i see, 1/100 and 1/1000 [18:05:42] for all the wikis that are in the mediaviewer group [18:05:47] wgNetworkPerformanceSamplingFactor is in the MediaViewer config for historical reasons but we plan to move the whole thing to core [18:06:06] basically you give it an AJAX request and it logs interesting data about the request [18:06:36] damn, forgot to change that 100 to 1000 per the end of the discussion [18:06:44] did you merge it already, ori? [18:07:07] tgr: yes, i can sync a follow-up patch tho [18:08:14] https://gerrit.wikimedia.org/r/133756 [18:09:20] milimetric: ...the rest of the logging collects data that is generated specifically by MediaViewer so I figured I would create an extension-specific variable for it [18:09:44] makes sense [18:09:44] tgr: about to sync your follow-up, is there another patch? [18:09:47] about default/mediaviewer, that's just a way of keeping track of beta status [18:09:51] you mentioned "the other half is coming up" [18:10:17] when it is in beta, 1:1000 sampling is too high to get useful results [18:10:32] ori: not yet, I am talking too much and coding too little [18:10:44] sorry, that's my bad [18:12:48] milimetric: https://gerrit.wikimedia.org/r/#/c/133668/ cld use a review :) [18:13:21] milimetric, any idea which port hive uses? [18:14:13] Ironholds: the internets are saying 10000 but i don't know if ottomata changed it, and he's still in Paris [18:14:25] aha [18:14:26] thanks! [18:14:26] try 10000? but wait, why do you need the port, oh for your R stuff [18:14:29] yep [18:14:37] Unfortunately it responds by warning me and then timing out [18:14:51] and I can't tell if that's the hey-I-don't-exist-here timing out or something else. [18:20:07] ori: I read it and I'm enabling it in vagrant to try to run the tests [18:20:26] right now I don't get why the invalid event should return true from the validate call :) [18:20:27] milimetric: hey, awesome, thanks! [18:20:43] milimetric: if all properties are optional then an empty event is valid [18:21:01] the question is not re: validation but serialization [18:21:39] php can't distinguish between an empty associative array and an empty indexed array, they're both just array() [18:21:41] maybe it's just a naming nitpick, instead of INVALID_EMPTY_EVENT should it be WEIRD_EMPTY_EVENT? [18:21:54] right, no that's in the code change, that part's cool [18:21:57] I mean the test [18:22:09] oh whoa! [18:22:14] i didn't notice the test at all! [18:22:16] nuria wrote it [18:22:19] that's awesome! [18:23:07] " Could not establish connection to jdbc:hive2://192.168.0.1:10000/default: java.net.ConnectException: Connection timed out" bah [18:23:26] milimetric: no preference either way re the name [18:23:38] seems a bit confusing, I'll comment but +1 [18:23:51] ori: but so now I have this checked out, how would I run the tests? [18:26:56] oh, i get it, i'll continue the conversation there and ask nuria for help, thanks [18:27:33] milimetric: the phpunit setu p changed recently and i'm not sure [18:27:35] was just checking for you [18:28:12] but i wasn't being sarcastic or anything, i really hadn't noticed the test that nuria added [18:32:08] bbiaf [18:57:38] ori: mind waiting an hour or so? I submitted the patch but it is non-trivial and everybody it out to lunch [18:57:59] MediaViewer being broken on current master due to a core bug didn't help either [18:59:08] ori: in case it can't, the patch is https://gerrit.wikimedia.org/r/133766 [20:12:15] tgr: i don't mind waiting at all [20:12:17] ping me when it's ready [20:17:10] (PS1) Milimetric: Move wikimetrics database models to models/storage [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/133822 [20:33:10] milimetric, you still around? [20:33:31] yes and that was very creepy Ironholds, because I'm watching you right now [20:33:32] :) [20:33:36] hah! [20:33:37] (in the stream with Lila) [20:33:40] yep [20:33:46] can you access the hive dashboard, a la http://analytics1010.eqiad.wmnet:8088/taskdetails.jsp?jobid=job_1387838787660_10050&tipid=task_1387838787660_10050_m_000000 ? [20:33:47] but yea, i pause [20:33:55] I've been trying Otto's suggested workflow on wikitech, and it doesn't work. [20:34:03] I'm pretty much entirely blocked on...everything, because of that. [20:34:15] oh no, that sounds bad [20:34:22] ok, so you're trying to use ktunnel and all that? [20:34:31] i haven't done that in ages, let's see... [20:34:33] no, I went for the ssh -N -D with FoxyProxy version [20:36:14] uh, Ironholds, can you link me to that? [20:36:25] I'll rant about how I hate finding content on-wikis later [20:36:30] https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Access [20:37:00] ugh, it was https://wikitech.wikimedia.org/wiki/Analytics/Kraken vs https://wikitech.wikimedia.org/wiki/Analytics/Kraken/ [20:37:02] how dumb... [20:40:43] ok Ironholds, so I was able to hit it through ktunnel [20:40:52] let's give that a shot [20:41:07] hmn, okay. thanks! [20:41:16] what's your SSH config look like? Did you copy my example a while back and have a "bastprod"? [20:41:19] what exactly do I do with the script, just run it on my local? [20:41:21] yep [20:41:23] ok [20:41:31] slightly different syntax, but same principle. [20:41:32] then, forget ktunnel and do this: [20:41:33] ssh -N bastprod -L 8088:analytics1010.eqiad.wmnet:8088 [20:41:41] aha [20:41:43] and then [20:41:44] http://analytics1010.eqiad.wmnet:8088/cluster [20:41:55] and let me know if that does/doesn't work [20:43:04] lesse.. [20:44:12] milimetric, okay, should I be doing anything with the browser proxy after ssh -N bastprod etc, etc? [20:44:38] no, i don't have any proxies set up [20:45:24] if you have access to analytics1010 through bastprod (ie if you can do "ssh -A bastprod" and from there "ssh analytics1010....") then it should just work [20:47:16] Ironholds: sorry, forgot to ping ^ [20:47:33] so, no luck? [20:47:54] nope. I'll try NOT using the proxy. [20:47:56] 'angon [20:48:10] no dice. this is weird. [20:50:03] milimetric, in the short term, could you look at the example I linked above and tell me what it said? ;p [20:50:10] I'm totally blocked until I find that out [20:50:31] Ironholds: that's the weird thing, I couldn't load up that job [20:50:37] ehhwha? [20:50:41] like, it 404d, or..? [20:50:45] yes [20:50:47] 404 [20:50:48] oh, you too. [20:50:56] it 404d through curl ON analytics1010, too. [20:51:04] oooh [20:51:06] :) [20:51:07] oops [20:51:12] oopd? [20:51:28] bad debugging by me, ok, so it sounds like you spun up a job and it disappeared? [20:51:42] spun up a job, died, that above is the link to the job report it gave me [20:52:02] which I cannot access and which 404s internally [20:52:31] I see this: http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_10050 [20:52:32] so hell, maybe the proxy /is/ working ;p [20:52:39] right [20:52:47] can you hit just /cluster through the proxy? [20:52:51] and/or the link I just pasted? [20:53:31] not the link, certainly [20:53:40] although curl confirms something is there claiming I killed the process, which...what? [20:53:45] because I sure as hell didn't ;p [20:54:23] it must've died somehow, hold on, with the way I'm doing it I have to keep guessing what other ports to tunnel [20:54:48] huurm. [20:57:27] ah, god, the way debugging works is nuts on these things [20:57:42] Ironholds: what's the query you were running [20:57:58] does it hang for like a few hours and then die? or can we just retry it easily? [20:58:11] in the job page I see "SELECT * FROM ( SELECT dt,ip...5000000(Stage-1)" [20:58:50] set hive.mapred.mode = nonstrict; [20:58:50] ADD JAR /usr/lib/hcatalog/share/hcatalog/hcatalog-core-0.5.0-cdh4.3.1.jar; [20:58:50] SELECT * [20:58:50] FROM ( [20:58:50] SELECT dt,ip AS IP FROM wmf.webrequest_text [20:58:51] WHERE year = 2014 [20:58:53] AND month = 05 [20:58:55] AND content_type RLIKE('text/html') [20:58:57] AND ip NOT RLIKE(':') [20:59:01] ORDER BY rand()) dtretrieve [20:59:03] LIMIT 5000000; [20:59:05] easy retries; it dies immediately. [21:01:51] hm, Ironholds, what you pasted would have a syntax error because you need the limit inside the dtretrieve subquery [21:01:59] since hive doesn't let you order if you don't limit [21:02:14] is that what you meant to write? limit inside there? [21:09:53] ori: https://gerrit.wikimedia.org/r/#/c/133829/ https://gerrit.wikimedia.org/r/#/c/133835/ [21:10:14] milimetric, actually, it does [21:10:21] if you set a nonstrict mode [21:10:22] (I think? I'll test) [21:10:27] ah ok [21:10:28] sorry [21:10:34] missed that : [21:10:48] yeah, it does. [21:10:54] or at least, has in the past. [21:11:06] the LIMIT being outside the subquery is necessary to be performing random selection [21:11:10] otherwise I'm selecting the first 5mil items in a random order. [21:11:25] right [21:14:05] tgr: maybe mw.config.get( 'wgMultimediaViewer', {} ).samplingFactor ? [21:14:17] well, besides wanting to murder the people that wrote the monitoring infrastructure, I have nothing helpful to offer Ironholds [21:14:18] that way you avoid an exception on property access on null if wgMultimediaViewer is unset [21:14:19] sorry :( [21:14:32] milimetric, that's okay! [21:14:43] like... it's insane - it has a status page which deletes itself once the job fails [21:14:46] I'll play around with subsections of it and see which bit is causing hive to balk [21:14:46] yep! [21:15:00] ori: not sure if that's something that should be expected to happen [21:15:26] milimetric, question. can you see http://analytics1010.eqiad.wmnet:8088/taskdetails.jsp?jobid=job_1387838787660_10065&tipid=task_1387838787660_10065_m_000000 ? [21:15:38] tgr: ok [21:15:47] tgr: looks ok to me, should i merge? [21:15:49] no, taskdetails.jsp is not found [21:15:54] okay, that's interesting [21:15:59] that was a second, totally unrelated query. [21:16:09] SELECT uri_path,uri_host,uri_query FROM webrequest_text WHERE year = 2014 AND month = 04 LIMIT 1; [21:16:13] ori: go ahead [21:16:24] tgr: should i override jenkins? you're sure about the failures? [21:16:24] went to 10%, then died. [21:16:32] I'm wondering if this could be an actual issue with the system. [21:16:49] maybe there's a problem with nonstrict, one sec [21:17:06] ori: yes, I fixed that specific bug in master not so long ago [21:17:14] it's a jQuery 1.9 incompatibility [21:17:18] kk [21:17:24] it only broke the tests anyway [21:17:46] although I really don't understand how it can happen on wmf4 [21:18:38] Ironholds: this doesn't look normal, I think it's safe to assume the cluster is dead-ish [21:18:44] oh fluck. [21:18:57] Okay, I'm now blocked on...well, everything. One moment [21:19:17] https://www.youtube.com/watch?v=l1dnqKGuezo [21:19:19] there we are [21:19:52] lol [21:20:07] exactly [21:21:17] I feel like we need to have that to hand. Okay, you want me to throw a bug in bugzilla, or...? [21:21:26] oh, wait. the queries are working now. [21:21:27] that's...weird. [21:21:34] in one case WITH non-strict, in one case WITHOUT. [21:21:37] So the cluster is still boned! [21:21:51] oh! wait a minuteutue [21:21:51] andrew made everything go to "the one table" [21:22:06] there are still indexes in the others, at least. [21:22:34] I worried about that and so I checked select distinct(month) from webrequest_text etc etc, and there are still partitions. [21:22:39] I assumed that meant there was still stuff ;p [21:23:38] the now-running queries are running on webrequest_text, to boot. [21:25:20] ok Ironholds, so I don't know what weirdness is happening [21:25:31] BUT I do know that andrew said he moved everything to a new table called "webrequest" [21:25:35] which has "webrequest_source" [21:26:03] and webrequest_source='text' should mean the same thing as webrequest_text [21:26:22] but I definitely see errors in this that are due to the way he's set this up [21:26:27] huurm. [21:26:38] okay, I'll try running against webrequest and see what happens. [21:26:48] so Ironholds: final verdict - we borked the cluster, don't rely on it until we can talk to Andrew. And yes, make a bugzilla bug if you don't mind [21:27:23] okay. Saying...? [21:27:23] well, I just tried webrequest and it still threw an error [21:27:23] albeit a more useful one about partitions being messed up [21:27:23] java.io.FileNotFoundException: Path is not a file: /wmf/data/external/webrequest/webrequest_mobile/hourly/2014/05/15/08/08 [21:27:29] hurgh. [21:29:35] tgr: i'll do wmf5 first and ask you to verify, is that cool? [21:29:38] i'm ready to sync [21:30:50] ori: sure [21:31:08] milimetric, what exactly do I put in the BZ report? ;p [21:31:16] "queries sometimes fail and there's no job report"? [21:31:33] yes Ironholds, that's cool [21:31:39] and I'll add what I pasted above [21:31:41] heh [21:31:42] kk [21:33:54] tgr: done on wmf5 [21:34:00] might take up to five mins for RL cache [21:35:25] ori: looks good [21:35:34] ok, i'll go ahead with wmf4 then [21:37:30] tgr: synced [21:38:56] ori: looks good as well, thanks [21:41:29] tgr: thanks very very much for responding to this so quickly [21:42:37] we should have had a sampling parameter in the first place, even if it is set to 1 [21:42:51] that's what you get for being lazy [21:44:31] milimetric, https://bugzilla.wikimedia.org/show_bug.cgi?id=65420 [23:18:27] How do I file a bug against wikipediatrends.com? [23:19:04] nevermind, apparently that's not our site