[08:59:17] (PS1) Yurik: top 5 languages [analytics/zero-sms] - https://gerrit.wikimedia.org/r/163128 [08:59:36] (CR) Yurik: [C: 2 V: 2] top 5 languages [analytics/zero-sms] - https://gerrit.wikimedia.org/r/163128 (owner: Yurik) [10:31:56] (CR) QChris: "> But I don't think we need to name the containing directory with generate_" (2 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/162589 (owner: QChris) [14:41:27] (CR) Ottomata: ">The name" (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/162589 (owner: QChris) [14:45:17] ottomata: I'm sorry :( it only backfilled partly, so would you mind deleting from line 114 to the end of /a/limn-public-data/mobile/datafiles/thanks-daily.csv on stat1003? [14:45:59] all the way to the end? [14:46:04] yep [14:46:11] you don't want th recent data that is no 0s? [14:46:12] not* [14:46:22] no, it'll regenerate that [14:46:24] ok [14:46:33] thanks [14:46:42] done [14:47:55] milimetric: [14:48:07] thanks very much [14:48:16] Fabian is asking me if there is a database where he can export data from HDFS [14:48:20] he could save stuff in files [14:48:26] but he's asking about mysql or redis or something [14:48:32] is there one he can use now? [14:48:40] or, should I ask ops if we can set one up...maybe on stat1002? [14:48:56] i think that would be generally useful, and there isn't one right now [14:49:11] stat1003 has access to analytics-store and that has the staging database [14:49:16] well, there are the research dbs, right? people do write data to tables there in teh research database [14:49:16] right? [14:49:21] yeah [14:49:27] 'staging' database [14:49:29] but that's only accessible from stat1003 right? [14:49:35] naw, you can access that from anywhere in prod [14:49:39] oh [14:49:42] well - that [14:49:43] he'd need the research pw [14:49:49] :) [14:49:50] is that the right place for it, you thikn? [14:49:51] ;) [14:50:03] yes, that database is the right place for sure [14:50:14] ok [14:50:36] but it's important that we practice good citizenship and don't make a lot of tables there [14:50:41] hm [14:50:47] maybe I"ll ask ops [14:50:52] if we should use that or just create one on stat1002 [14:50:56] i think on stat1002 would be fine [14:51:03] as it would be more for storgage and light querying [14:51:26] hm [14:51:39] i don't want to speak for springle [14:51:51] but i think his thinking with the staging database is to have all ad-hoc work there [14:52:07] and they'd add storage as needed there [14:52:25] i can free up a big table there if you run into space problems [14:54:06] k, asking [18:51:09] (PS1) Yurik: AllEnabled data aggregation, filter by wikipedia only [analytics/zero-sms] - https://gerrit.wikimedia.org/r/163220 [18:51:29] (CR) Yurik: [C: 2 V: 2] AllEnabled data aggregation, filter by wikipedia only [analytics/zero-sms] - https://gerrit.wikimedia.org/r/163220 (owner: Yurik) [19:05:07] dammit, no qchris. [19:05:13] ottomata, you got a sec? [19:05:38] yup [19:05:44] wasssup? [19:06:01] so, I'm writing up the PV documentation. or, proposed documentation, I guess. [19:06:11] got to the bit of filtering to only-completed requests, right? [19:06:33] ok? [19:06:34] so, should I just be looking for 200s? Or should I be factoring say, 302s in, too? How does SSL interact with our request logs? [19:06:48] I appreciate this is probably a tremendously stupid question and I just don't know enough to know that ;p [19:06:51] SSL doesn't we get logs only from varnish [19:06:53] so [19:07:00] the requests we get in kafka (and hadoop) anyway [19:07:03] only come from varnish [19:07:09] so, SSL requests, are proxied from nginx to varnish [19:07:26] the only difference is that the requestor IP will be in a different place [19:07:35] in X-Forwarded-For header somewhere [19:07:47] rather than in ip: field [19:08:00] um, as for other HTTP response codes [19:08:04] i don't know, i think that's up to the definition [19:08:10] * Ironholds nods [19:08:22] like, what about page move redirects? (those do exist, right?) [19:08:29] do we want to count views to old page names as views? [19:08:31] dunno! [19:08:35] heh [19:09:00] that's the thing. Do they? If I hit a *mediawiki* redirect, do I get a 3xx code at that redirect? [19:09:13] dunno! [19:09:15] you can find out easy though [19:09:17] To my knowledge it'll just appear as one request, pointed at [[title_of_redirect]] [19:09:19] you got an example somewhere [19:09:21] yeah [19:09:21] hmm, i might know one [19:09:38] naw, I've got an obvious one [19:09:40] [[Obama] [19:09:41] *] [19:09:53] cu[:~] $ curl --head https://wikitech.wikimedia.org/wiki/Analytics/Kraken [19:09:53] HTTP/1.1 200 OK [19:10:02] huh; interesting. [19:10:09] So really I should...just be looking for 200s. And that's it. [19:10:18] HTTP/1.1 200 OK [19:10:20] for Obama [19:10:30] you operations people and your curl mastery [19:10:31] well, i dunno, do you want to count 400s? i think non existent pages are 400s [19:10:47] nope, non existent pages also 200, I think [19:10:50] [:~] $ curl --head https://en.wikipedia.org/wiki/Punderdome [19:10:50] HTTP/1.1 404 Not Found [19:11:01] phew [19:11:11] oh [19:11:12] no [19:11:13] 404 [19:11:15] and, I don't think we want to count non-existent pages [19:11:17] yeah, fair [19:11:25] Ironholds: btw, there's a limited number of circumstances where you'll get a 3xx redirect [19:11:34] awesome! In 15 seconds you've resolved an hour of headscratching [19:11:35] YuviPanda, oh? [19:11:35] curl -I http://en.wikipedia.org/wiki/heeeeeeeee [19:11:41] aha [19:11:43] Location: http://en.wikipedia.org/wiki/Heeeeeeeee [19:11:43] probably not, but maybe you want to be all like "5 million requetss to this page last week, but it doens't exist, what gives?! someone needs to edit!" [19:11:45] caps [19:11:46] because of the automated capitalisation of H [19:12:05] YuviPanda, can you think of others? [19:12:05] OO [19:12:17] let's throw this out to the audience! :D [19:12:24] you also have /w/index.php?title=heeee [19:12:33] and the entire /w/index.php?title= variants [19:12:33] and the crucial thing: in the request logs, would that display as a 302, followed by a 200 to [[Hee]]] [19:12:35] or just a 302 [19:12:39] yeah, totally [19:12:41] Ironholds: yes [19:12:43] there would be 2 requetsws [19:12:48] ottomata, perfect. [19:12:57] any redirect would show as the first 3xx request, and then the redirected request later [19:13:03] * Ironholds nods [19:13:23] right, unless the 3xx redirect is to a page that doesn't exist, in which case it's a 3xx to a 404, but I guess you can disregard that [19:14:00] another interesting but in this context useless response is [19:14:06] curl -I https://en.wikipedia.org/hi [19:14:10] returns [19:14:10] Refresh: 5; url=https://en.wikipedia.org/wiki/hi [19:14:16] with a code of 404 [19:14:37] makes sense [19:14:54] wait [19:14:55] outside of that, I can't think of any (other than action=mobileview, that is) [19:14:55] wjhat? [19:15:00] * Ironholds headdesks [19:15:07] Ironholds: ? [19:15:14] the 404ing [19:15:18] ah :) [19:15:21] oh well [19:15:28] and explain the action=mobileview problem? [19:15:35] Ironholds: oh, apps :) [19:15:39] and *sometimes*, mobile web [19:15:45] although mobile web is removing that feature, I think [19:15:47] okay. Explain that! [19:15:50] ..in 16 seconds [19:15:53] I've got to go to the shops [19:15:59] Ironholds: apps use action=mobileview to get page content [19:16:01] (or explain now and I'll read it when I get back. Shouldn't take long) [19:16:05] you already know about page views in apps [19:16:23] totally [19:16:24] and mobile web, *sometimes*, when you make an edit, it'll load the page again (refresh) by requesting sections from action=mobileview, and re-render them [19:16:29] * Ironholds nods [19:16:33] and these use what status code? [19:16:35] this is so that the user doesn't have to do a full page refresh after the edit completes [19:16:42] 200 [19:16:50] but the 'ajax refresh' feature is being removed [19:16:53] from mobile web [19:16:59] gotcha [19:16:59] so... you wouldn't have to worry about it too much [19:17:15] so, with the exception of one fubar in our infrastructure, where we're redirecting but reporting 404, there are two categories of request [19:17:22] first, successful requests. These 200. [19:17:26] yeah [19:17:46] this includes what we, wikimedians, understand as redirects (#REDIRECT...) [19:17:55] second, hard redirects, normally as a consequence of specific parameters being used, or MW upper-casing the first character [19:18:01] these appear as 302s followed by 200s [19:18:27] taking off early everyone, shoot me an email if you need me [19:18:29] If we just look for 200s, we'll catch both categories. If we look for 200s and 302s, we'll actually duplicate occasional requests, and not be adding any value. [19:18:35] yup [19:18:40] in either case, we miss out on one particular permutation of the hard-redirecting [19:18:47] because MediaWiki hates us and wants us to hurt. [19:18:55] but there's nothing we can do about that. [19:19:05] Okay. so I'll write all this up and credit you two, and just look for 200s :D [19:19:13] :) [19:19:31] (PS9) Nuria: Bootstrapping from url. Keeping state. [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) [19:19:32] every time I think I've plumbed the depths of "dumb stuff MediaWiki does"... [19:19:49] my favourite is the PHP exception in our API that, in the event a request fails, reports a 200-status text/html response. [19:19:57] I had to write in some fun caveats for /that/. [19:20:15] "text/html and NOT api.php, orrr application/json AND api.php" [19:20:21] ah [19:20:22] that [19:21:26] (CR) Nuria: Bootstrapping from url. Keeping state. (6 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/160685 (https://bugzilla.wikimedia.org/70887) (owner: Nuria) [19:29:24] ottomata: rebased [20:25:08] what the heck are people doing on the hadoop cluster that's making reduce jobs so slow? [20:25:35] (PS1) Nuria: Removing gulp-clean from build [analytics/dashiki] - https://gerrit.wikimedia.org/r/163259 [21:07:43] milimetric: yt? [23:38:37] Analytics / Wikistats: Identify MSIE 11 somehow - https://bugzilla.wikimedia.org/64125#c6 (Bartosz DziewoƄski) I see this is not live *yet*, how often are the stats pages rebuilt?