[05:03:16] 18 04:57:45 < gry> hmm, those look nice, except I'm looking for stats for one page. Not clients on all projects, but clients that visited one page. (and where they came to it from) [05:03:20] 18 04:58:59 < jeremyb_> you'll need someone NDA'd to do the analysis for you. AFAIK [05:03:23] 18 05:00:43 < gry> I'd be somewhat inclined to get such information about once a month, but if such request requires escalating the issue, it is a bit odd [05:03:28] moved from #-tech. you can see more context there [05:03:38] 18 05:01:49 < jeremyb_> gry: i suggest you move this to #wikimedia-analytics and try again during the work day US Pacific time [05:03:42] :) [05:05:02] ... and I would be somewhat willing to get a sample of the logs and give someone a script to get the data I need, although writing such script could take a few days. If that's considered confidential or violating privacy or something, I'm fine with some URL to those. The goal is to use such data to increase usability - ideally something with useragents and referreers, so I can see where bots come onto a page from, and where humans come onto a page [05:05:02] from [05:09:19] gry: all useragents and referrers potentially might themselves be confidential. if they are ranked somehow and they hit a threshold of popularity or otherwise whitelisted somehow then I think they can be released [05:09:44] useragents can be so unique that only 1 person in the world has one [05:09:59] yup, that's understandable [05:10:02] referers could also potentially identify a single individual [05:10:08] I'd be happy with 'bot/non-bot' separation for starters [05:10:32] and referers sorted by popularity with some kind of threshold? [05:10:35] not sure how to do this [05:11:14] but analytics has to work somehow, I see some sites rely on the currently available stats while it's not even possible to see how many of them were bots for a particular page [05:11:45] it boils down to "where do people land to this page from", probably not a complete list but at least an idea would be good [13:06:45] mornin [13:08:05] hi [13:08:24] check channel log please, I had a weird question earlier [13:08:29] bulky too [13:46:39] morning [13:47:07] you wanna demo something? [13:48:41] mooorning [13:48:46] who? me? [14:02:05] yoyo [14:02:41] do you know what's left to do to make metrics api work on stat1001 except for debianizing flask-login? [14:12:07] maybe you can have a look at card https://mingle.corp.wikimedia.org/projects/analytics/cards/319, is this clear enough for you to get started? [14:12:30] i don't know for sure, there are always new errors there as ryan develops [14:12:40] we should get him a labs instance for testing that is set up in the same way [14:12:42] could you guys look at what's starting to eat memory on oxygen? [14:12:52] i'm looking at oxygen now, ok [14:13:05] (writing an email, but not about memory) [14:13:24] thanks [14:15:05] how quickly is the locke replacement going to be in place? [14:20:30] its there now, things just need switched over [14:20:37] and the frontend caches need the config change deployed [14:20:46] I could probably get that done today, at least for the non FR udp2log stuff [14:21:37] robla, there's lots of memory free on oxygen, udp2log processes are not using much [14:21:41] puppet is using more than any other process [14:22:28] ottomata: it doesn't look like we're having packet loss yet, so it's not too surprising [14:22:33] i think we're waiting to catch the packet loss [14:22:33] right [14:36:16] ottomata: looks like the packet loss on oxygen is starting to ramp up [14:38:42] k checking, thanks [14:39:06] ottomata, thanks for the error message from the stat1 cron job. I'll ping notpeter as soon as he's on [14:43:15] cool, mk [14:50:05] milmetric, did you get a chance on friday to graph zero data? :D [14:54:03] [travis-ci] master/3fca6e7 (#96 by Andrew Otto): The build has errored. http://travis-ci.org/wikimedia/kraken/builds/5596168 [15:00:51] hey otto, stop misspelling my name :) [15:01:06] ottomata: I didn't get a chance to do zero on Friday [15:01:17] ok [15:01:18] I was wrapped up in the stat1 mayhem [15:01:25] aye, what happened? :) [15:01:28] and working with notpeter to puppetize [15:01:44] aye [15:06:39] gry: I read through the logs and I've no clue what you're referring to [15:06:45] sorry :) [15:21:18] erosen, ping me when you're around [15:22:36] milimetric: just heading out ot a conference, will be back in 45 at the soonest [15:22:44] milimetric: anything urgent? [15:22:52] cool, nope, just wanted to talk about zero [15:22:56] it can def. wait [15:23:01] k, will ping you when I'm back [15:23:03] latersss [16:28:20] ok, milimetric, drdee, blog email should be fixed [16:28:36] oh cool, what was wrong? [16:28:43] many things :/ [16:29:30] the variable is $passwords::mysql::research::pass, the passwords::mysql::research class was not included, the command had a newline in it (but that might have been my fault) [16:29:35] would love a rundown when you have time 'cause I agree with drdee: we should all have some level of proficiency in puppet [16:29:35] ah [16:29:55] so [16:29:56] this [16:29:57] $passwords::mysql::research::pass [16:30:06] is defined in a class called passwords::mysql::research [16:30:12] like [16:30:14] $pass = '....' [16:30:26] so you can use that variable from that class by fully qualifying it [16:30:35] but you have to make sure that the class is included on that node [16:31:20] ok, i'm going to run home and stop by the grocery store, I need some kale [16:31:23] cool [16:31:31] be back for standup [16:31:34] latas [16:38:25] I agree with milimetric [16:38:30] I lack puppetfu [16:38:34] and pigfu [16:38:42] yes [16:38:49] among other things [16:39:03] and now that we know we should always puppetize 100% of our work, it's very important [16:39:49] milimetric: is it official with the 100% puppetization ? [16:40:14] well, anything that we do on the production cluster should be 100% puppetized at the request of Leslie [16:40:19] and it makes good sense to me [16:40:56] I'm reading YuviPanda [16:42:10] hm, I wonder if deploy.only_data will cause caching issues [16:42:38] YuviPanda: next time you have a problem with mobile_dev, would you mind doing a fab mobile_dev deploy [16:42:44] and let me know if the issue goes away? [16:43:13] milimetric: I just did one. [16:43:30] milimetric: the problem is that it is cached - i still see old data for a fair amount of time [16:43:30] deploy.only_data or deploy? [16:43:30] aah [16:43:32] only_data [16:43:32] right [16:43:34] missed that [16:43:42] i'm doing only_data [16:43:53] yep, and that should work [16:43:53] and I've to use incognito to see the changes [16:44:02] milimetric: so, just deploy should work? [16:44:04] but it should be simple to try just plain deploy [16:44:08] that'll restart the server [16:44:11] without any unncecssary side effects? :) [16:44:12] got it! [16:44:23] and tell me that the caching problem is between those two different deployments [16:44:29] yeah, it'll only affect your instance [16:45:07] ok [16:58:55] drdee: missing scrum today; at PyData [16:59:06] aight [16:59:46] scrum! [16:59:57] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [17:01:58] milimetric ^^ [17:02:34] yep, sec, tech probs [17:03:14] kraigparkinson: https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [17:04:06] omw [17:16:30] erosen: you're at pydata? [17:16:36] yups [17:16:47] listening to a talk on hdf5 [17:18:32] coolio [17:19:28] kraigparkinson: was there a specific time/venue where we wanted to talk about packetloss? [17:20:03] didn't schedule one, but was hoping you would have started the conversation during the hangout... [17:20:08] We did. [17:20:18] All of it is captured in the thread, I think. [17:20:36] btw, thx to you all naming oxygen "oxygen", I have this song stuck in my head: http://www.youtube.com/watch?v=zmlKjO4juCo [17:20:56] haha [17:20:56] robla, were you happy with the suggestions mentioned in the thread? [17:21:21] kraigparkinson: yeah, I think so. just needs to happen soonish [17:21:43] lol, what's soonish? timeframe please. :) [17:22:19] robla, I want to make sure we can assign an appropriate class of service. [17:22:22] dschoon & drdee, do we have infra tasks captured for those ideas in mingle so we can track? [17:23:42] kraigparkinson: can we quickly chat about this so i can update you ? [17:23:52] drdee, sure. [17:24:39] https://plus.google.com/hangouts/_/2da993a9acec7936399e9d78d13bf7ec0c0afdbc [17:31:34] YuviPanda: I wanted to help finish up the dashboards you've been working on [17:31:42] hey erosen, are you also going to pydata tomorrow? (and if so, how are you getting there? :P) [17:31:45] I saw some stuff in the logs about Ori helping with rsync [17:32:00] ori-l: am indeed [17:32:04] ori-l: I'm actually there now [17:32:13] ori-l: I live in Palo Alto, so I'll be driving [17:32:13] milimetric: yes, there's a gerrit changeset [17:32:24] and it hasn't been approved? [17:32:32] ori-l: you're more than welcome to catch a caltrain to palo alto and we can carpool from there [17:32:52] erosen: i may take you up on that! [17:33:18] ori-l: just let me know sometime today. where are you coming from again? [17:33:22] milimetric: it's https://gerrit.wikimedia.org/r/#/c/54116/ , and otto made a valid point, i just need to update the patch [17:33:50] erosen: north beach, bro. [17:34:03] * YuviPanda was at north beach with Ryan Faulkner [17:34:22] ori-l: I see, well a bullet caltrain should work well (you might want to coordinate with rfaulkner, too) [17:34:48] yes, i'll ask him today [17:34:50] thanks ori-l. YuviPanda, is there anything else that needs to happen after that rsync is set up? The files will just go to stat1001 and your dashboards have to be updated to pull from there, correct? [17:35:01] (the 'bro' was neighborhood self-deprecation :)) [17:35:23] milimetric: yeah, pretty much. I'll need a cron on stat1 too, I think [17:35:37] ori-l, erosen: already here also, I'm in "HDF5 is for lovers" ;) [17:35:40] and we'll need to see how it works with the cache, since I guess tehre won't be any fab deploys. [17:35:44] * YuviPanda waves at rfaulkner [17:35:46] erosen I think I see you [17:35:47] (ori-l: i suspected) [17:35:48] (didn't notice) [17:36:02] rfaulkner: i didn't make it today; when is your talk? [17:36:08] wednesday [17:36:09] YuviPanda: you mean besides this cron: https://gerrit.wikimedia.org/r/#/c/54116/4/manifests/misc/statistics.pp ? [17:36:13] rfaulkner: i see you now [17:36:15] hehe [17:36:49] milimetric: well, that just rsyncs. how do we *generate* them [17:36:50] ori-l: I drove down, I guess I'll just torture myself each day in that way [17:36:52] yes, I'm looking into the caching now [17:37:07] oh ok, so a cron to run your scripts [17:37:13] rfaulkner: you are really welcome to caltrain and then drive from PA [17:37:21] there are bullets form sf to pa [17:37:28] rfaulkner: would love to catch a ride with you if possible -- ! either by riding along or catching the train with you [17:37:32] i can help you by puppetizing that if you tell me what needs to run [17:37:40] and I live quite close to the caltrain station, so it is not out of the way [17:37:47] ori-l: defintely [17:38:42] and everyone: general question: how do you test changes to the puppet repo? <-- ottomata, ori-l [17:39:48] YuviPanda: it seems to me like the command from that rsync cron job should just be the last line of a script that does everything else too. These things would never run separately right? [17:40:20] hmm? I didn't get that. [17:40:40] that the rsync cron job should actually point to a script, so that it is 'generate + rsync' [17:40:41] ? [17:41:11] milimetric: you can use labs; i use vagrant [17:42:14] yes YuviPanda, that's what I meant [17:42:41] thanks ori-l [17:43:56] * YuviPanda should learn puppet [17:50:08] milimetric: can you leave a comment on the gerrit changeset then? [17:50:46] YuviPanda: well, i can do better than that. I can submit a new changeset once you let me know what needs to be in that script [17:51:03] milimetric: ah, right :) okay [17:51:15] milimetric: I can put a script in the repo for you to run. [17:51:34] add a new template in the templates/misc directory [17:51:59] I will make sure puppet puts it in the right spot and executes it with that cron job. [17:52:00] or wait [17:52:09] you can just send me the script :) [17:52:34] milimetric: i haven't written it yet :) [17:52:42] or... you can do the patchset yourself and I can try to teach you the TINY amount of puppet I know [17:52:57] but i'll need to test it, and my macbook air will die with the vm :( [17:53:04] also meeting time, brb [17:53:07] i'm sure ottomata will help out once we have something rather than nothing [17:53:11] k [18:14:40] erosen: !!!!! [18:14:42] erosen: http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/piggybank/storage/MultiStorage.html [18:14:43] HA [18:14:46] BAM [18:14:49] hehe [18:14:59] The UDF is useful for splitting the output data into a bunch of directories and files dynamically based on user specified key field in the output tuple [18:15:00] in the piggybank no less [18:15:03] yup [18:15:03] indeed. [18:15:04] awesome [18:15:04] built-in [18:15:21] we were totally in the right by thinking, "there is no way we are the first people with this problem" [18:38:26] dschoon, rfaulkner, python question [18:38:36] sup? [18:38:38] python modules, to install them, they really just need to be put in a directory, right? [18:38:56] /usr/lib/pymodules/python2.7, or something? [18:39:17] depends on what you mean by "install" [18:39:37] pip/easy_install does a bunch of a things [18:39:47] it creates .egginfo files and puts them in certain places [18:40:00] it sometimes updates a .path file in the site-root [18:40:18] it also uses well known paths to link together the directories [18:40:22] so ... not always [18:40:34] also, i think pymodules is for binary modules [18:40:40] like, C extensions [18:53:35] ^^ ottomata [18:55:34] i'm thinking of what a .deb would have to do [18:57:25] dschoon ^ [18:57:37] like [18:57:45] could I easy_install or pip install w virtual env [18:58:06] and then debianize by copying the relevant directories to the proper locations? [18:58:37] or maybe somehting like this? [18:58:37] https://pypi.python.org/pypi/stdeb [18:58:39] ... [18:58:41] i have no idea. [18:58:55] this is why i advocate 1. not installing things using .deb [18:59:02] 2. using local dir installs when possible [18:59:25] but i guess the problem here is that we don't want to install from the internet [19:00:14] so setting up our own pypi via https://pypi.python.org/pypi/pypiserver/1.1.0 would make this the cleanest [19:00:17] ^^ ottomata [19:00:21] what about pip install pip install -e git+git@github.com.... [19:00:30] again, from the internet :P [19:00:42] i meant from gerrit [19:01:44] naw, needs to be a deb, really [19:02:31] how about this [19:02:46] pack the source and the source of the deps into the deb [19:03:08] use virtual_env and pip to put in the right spot via deb? [19:03:11] ottomata: ^^ [19:03:57] dschoon: i like it [19:04:02] yeah that's what I was thikning, debian/rules could use virtual_env and pip to put the files into their proper directories inside of debian/ [19:04:09] no no [19:04:10] but this looks pretty good too [19:04:10] https://pypi.python.org/pypi/stdeb [19:04:35] i meant running pip inside ve at *install* time [19:04:38] oh have the .deb run virtual_env and poip commands [19:04:39] haha [19:04:39] hmmm [19:04:51] because i don't think you can freeze anything meaningful afterward [19:04:59] right, ha [19:04:59] hm [19:05:06] i dunno, i probably couldn't get away with that [19:05:14] install location is dependent on a thousand things [19:05:28] i'd also recommend you look at what the existing python-* packages do [19:05:34] the ones in apt [19:05:55] i predict this is going to be a lot of effort for no gain, though [19:05:55] python setup.py --command-packages=stdeb.command debianize [19:06:17] no idea. [19:07:53] like, `aptitude search python-sqlalchemy` [19:08:04] that's just the sqlalchemy package, redone for apt [19:08:09] did they use stdeb? [19:08:12] i have no idea. [19:08:17] ^^ ottomata [19:08:19] brb lunch [19:18:10] another run for november and december started on stat1 [19:18:27] with the same charts that we I last discussed with Evan [19:18:41] I am confident I have transferred sufficient logic from kraken to solve the bump [19:18:45] we'll find out in a couple of hours [19:18:52] yay [19:19:15] I've isolated days around the gap 10 december => 18 december (the gap is on 14) and ran on those, and the gap disappeared [19:19:29] niiice [19:19:41] so do you know what the problem feature is? [19:20:45] erosen: I am led to believe that we can attribute the 500M bump to the dash mimetype in december [19:20:49] http://stat1.wikimedia.org/spetrea/embr_py_mobile_pageviews/r3-resized-charts/chart_mime_type=-.png [19:21:21] the logic I ported from Kraken is the following https://github.com/wikimedia/metrics/commit/364bfb25b572558698bba51fb7d221511284f6b6 [19:23:55] stdeb seemed to work pretty good~!!!! [19:24:35] awesome, ottomata [19:24:36] ottomata: oh cool :) [19:33:24] yeah, ok, if I can do what I just did AND use git-buildpackage at the same time [19:33:27] I think I can do this [19:35:20] ottomata: why do you need git-buildpackage if stdeb already makes the deb for you? [19:35:36] faidon wants us to use that as the standard format [19:35:48] ok [19:35:53] mainly, in this case, git-buildpackage will just enforce a branching structure [19:36:04] i'll use stdeb to create the debian/ files [19:36:12] cool [19:36:13] and then git-buildpackage to build the .deb [19:36:29] nice, I learned about stdeb today, didn't know about it [19:36:45] yeah, i just googeld and found that [19:38:26] I'm out to get some stuff, bb in 30m [19:39:11] average_drifter can you give me a sample of log lines with mime type = '-' [19:39:18] a couple of 1000 lines is enough [19:42:25] hey guys, brb 1 hour, gotta go pick up baby chickens [19:45:14] average_drifter ^^ [19:46:38] baby chickens? [19:48:36] kraigparkinson: are you still editing mingle card 367? [19:48:54] yep, will be done soon. [19:50:35] done for now. [19:51:21] just copying in some context from the email thread into analytics/#367. please add/delete what you think will be helpful. I want a coherent picture (non-threaded) of what's happening. [20:34:35] ottomata: can you have a look at https://gerrit.wikimedia.org/r/#/c/53624/ [20:35:28] gotcha [20:37:30] erosen: you about? [20:37:36] dschoon: yup [20:37:40] was just working on a response [20:37:42] but this is justa s good [20:37:48] question about zero stuff [20:37:51] yeah [20:37:56] were you gonna ask about that? [20:38:00] yeah. [20:38:03] k [20:38:09] so i was not comparing against kraken [20:38:18] ah. [20:38:29] but instead, just looking at what was captured with udpfilter [20:38:38] i checked how often the x-cs header was set correctly [20:38:49] i actually was going to ask about our new kraken script, but ok [20:38:53] go ahead :) [20:38:55] ah [20:38:59] in theory, every mobile requests in the 1:10 sampled stream should have the x-cs header set [20:39:02] hehe [20:39:08] ok, well that's all I was gonna say [20:39:11] i didn't want to assume that because it looked like x-cs logging was disabled in puppet? [20:39:19] so go ahead with the other q [20:39:39] hokay. [20:39:49] hrm.. it definitely shows up a fair amount in the zero logs on oxygen [20:39:50] output format for both the by-carrier and by-geo files is: [20:39:50] COUNT, DATE, LANGUAGE, SITE, VERSION, COUNTRY, PARTNER NAME [20:39:57] hold that thought. [20:40:13] for the by-carrier files, that count is the count of... what? [20:40:48] the number of page requests which fall in that bin [20:40:51] COUNT( DISTINCT(date, carrier, version) )? [20:40:53] does that make sense? [20:40:57] does country count? [20:41:03] for *by carrier* [20:41:03] yeah [20:41:07] i aggregate it later [20:41:08] okay. [20:41:11] so we could leave it out [20:41:17] but it was nice to have everything in the same format [20:41:21] okay. [20:41:23] and it will be useful for doing forensics [20:42:04] i mean, it's not that big a difference to the script [20:42:05] btw, a thing i think we didn't realize about pig [20:42:05] COUNT takes an *expression* [20:42:31] carrier_count = FOREACH (GROUP log_fields BY (day_hour, carrier)) [20:42:32] GENERATE [20:42:32] FLATTEN(group), [20:42:33] COUNT(log_fields.carrier) AS total, [20:42:35] COUNT(log_fields.hostname MATCHES '\\.m\\.') AS M, [20:42:37] COUNT(log_fields.hostname MATCHES '\\.zero\\.') AS Z, [20:42:39] COUNT(NOT (log_fields.hostname MATCHES '\\.(m|zero)\\.')) AS X [20:42:41] ; [20:42:59] interesting [20:43:07] didn't know that either [20:43:08] very nice [20:43:18] back [20:43:34] but anyway, you're saying that i need to be grouping on (day_hour, carrier, country) [20:44:26] and language, site and version … [20:44:29] right? [20:44:48] really. [20:44:52] okay! [20:44:53] uh [20:45:07] how much time do you have? [20:45:27] like right now [20:45:28] i kinda want to walk through it one more time (from the top) to make sure we have it right [20:45:28] ? [20:45:30] yeah [20:45:32] i can chat [20:45:38] yeah, want to video chat? [20:45:39] you in a thing? no hangout? [20:45:41] yeah. [20:45:46] i'm in a not-that-useful session atm [20:46:02] i'll step into the lobby [20:46:05] one sec [20:46:23] k [20:46:42] erosen: https://plus.google.com/hangouts/_/cb6d9578920149d8fabc14c6a243bd1498dfca25 [20:52:03] dschoon: can you hear me? [20:52:10] couldn't for a minute [20:58:14] dschoon: can't hear [20:58:30] yes [21:21:58] ottomata: can you update card https://mingle.corp.wikimedia.org/projects/analytics/cards/319 with the things you have done and what's left to be done? [21:22:20] yeah one sec [21:22:33] will do before I quit today [21:23:46] drdee: scp diederik@stat1.wikimedia.org:/home/spetrea/sample_dash_mimetype_16_dec_2012.gz . [21:23:57] ty [21:25:15] drdee: you're welcome [21:25:23] many jpg seem to have dash mimetype [21:25:25] which is weird [21:25:44] interesting [21:26:29] gry: I read through the logs and I've no clue what you're referring to [21:26:57] milimetric: https://pastee.org/397zb [21:27:31] thanks gry [21:27:35] looking [21:31:10] gry: we have multiple types of bots [21:31:15] gry: search engine bots [21:31:35] gry: wikimedia bots => pywikimediabot [21:31:45] gry: and custom bots people may write [21:31:51] gry: which are you interested in? [21:32:42] I'm interested in the data described: list most popular referers for human visits on specific page [21:33:15] and same for bots [21:33:35] so you want to know where bots and human traffic comes from [21:33:38] for bots optional, I'm mainly interested in humans; I don't see how this is doable with a bot which accesses mediawiki api [21:33:40] yes [21:34:17] for a specific page, not for an entire project. and not total traffic, it -could- be usefully filtered out so we only watch human traffic [21:34:31] dartar, you there? [21:34:49] if it's a nice bot (for example pywikimediabot) it will tell us it's a bot through its UserAgent I imagine [21:35:16] doesn't pywikimediabot only access mediawiki api? [21:35:24] ah yes, it would tell, yes [21:35:28] :) [21:35:40] some bots wouldn't, it'd be an empty useragent or so [21:35:46] we could whitelist few browsers [21:36:09] gry: please discuss with drdee so we can see if and when we have room for your request [21:36:34] gry: also discuss with kraigparkinson [21:36:43] gry: can you send me an email with your request? [21:36:58] plus context and reasons? [21:37:05] kraigparkinson: do we have a google hangout link? [21:37:24] drdee, don't think he's set one up, but I'll do it and invite him and the others... [21:37:30] k [21:38:06] drdee: which email? [21:38:27] dvanliere at wikimedia dot org [21:43:35] drdee, kraigparkinson: sorry, back to my keyboard [21:43:57] thanks. [21:45:26] drdee, kraigparkinson: added a hangout to the meeting [21:49:59] drdee: email sent - can you confirm you got it please? [21:50:55] got it [22:06:24] we're good [22:06:25] no more gap [22:06:26] http://stat1.wikimedia.org/spetrea/embr_py_mobile_pageviews/r4-kraken-partial-logic-implemented/chart_lang=en.png [22:06:29] or bump [22:07:21] the periodic weekly pattern shows in december too [22:07:38] average_drifter: what did it? [22:07:40] there's some symmetry in that chart, and symmetry's always good :D [22:07:47] erosen: just ignored mimetype dash [22:07:54] crazy [22:08:11] erosen: I agree [22:08:12] any theories on why we would do that? [22:08:34] average_drifter: as in, why was there a huge bump? [22:08:35] erosen: drdee mentioned that something happened with mimetypes being invalid [22:08:51] erosen: and his theory is confirmed by the fact that there are many jpg labelled with mimetype dash in december [22:09:07] erosen: scp erosen@stat1.wikimedia.org:/home/spetrea/sample_dash_mimetype_16_dec_2012.gz . [22:09:10] erosen: for proof [22:10:14] erosen: you will see there(in the .gz above) a lot of lines which show jpgs but they're labelled with mimetype dash instead of image/jpeg [22:10:27] average_drifter: but I'm confused why those wouldn't have been caught by the mimetyp='text/html' filter [22:12:15] erosen: https://github.com/wikimedia/metrics/commit/364bfb25b572558698bba51fb7d221511284f6b6#L1R442 [22:12:25] erosen: so I guess you're right, it's not hte mimetype [22:12:56] erosen: it's either the status code or the request method [22:13:57] average_drifter: let me get this straight: the script you just linked producing the bump right now? or __not__producing the bump? [22:14:18] @cache [22:14:19] def init_request(self): [22:14:19] """returns whether the log line represents a text/html request with status code < 300 whose url path starts with /wiki/""" [22:14:21] return self.mime_type() == 'text/html' and self.status_code() < 300 and self.url_path() and self.url_path()[0] == 'wiki' [22:14:24] @cache [22:14:27] def old_init_request(self): [22:14:29] """returns whether the log line represents a request to an article with url format /wiki/""" [22:14:35] <average_drifter> return self.url_path() and self.url_path()[0] == 'wiki' [22:14:41] <average_drifter> erosen: ^^ you used the old_init_request, so all it was checking is that the first part of url_path part was "wiki" [22:15:03] <erosen> i see [22:15:04] <erosen> good point [22:15:12] <erosen> so perhaps init_request would work [22:15:52] <average_drifter> erosen: probably [22:16:08] <average_drifter> erosen: should we pinpoint the cause? [22:16:11] <average_drifter> I guess we should [22:16:22] <average_drifter> I should make another run and find out exactly what caused it then [22:16:26] <erosen> eeh, would be nice to know [22:16:45] <erosen> average_drifter: so, to be clear, do you have a python script that works? [22:17:18] <erosen> (i.e. no bump?) [22:19:30] <dschoon> kraigparkinson: you in the office? or still wfh? [22:21:14] <kraigparkinson> dschoon, still wfh [22:21:22] <dschoon> kk, so hangout then? [22:23:55] <dschoon> brb [22:24:15] <average_drifter> erosen: yes, I have a python script that works [22:24:33] <average_drifter> erosen: using your code(slightly modified) [22:24:39] <erosen> average_drifter: so it's just a matter of removing filters until the bump appears right? [22:24:43] <average_drifter> yes [22:24:46] <erosen> average_drifter: actually maybe we're getting confused [22:24:52] <erosen> it seems like the mimetype is enough [22:25:04] <erosen> or i guess we haven't proved it [22:25:13] <average_drifter> we haven't proved that it's the mimetype [22:25:19] <erosen> yeah [22:26:40] <average_drifter> erosen: the only problem is that removing one filter and rerunning takes like 3h30m [22:26:45] <erosen> hehe [22:26:57] <erosen> can't we rerun it on the sampeld file? [22:27:14] <average_drifter> erosen: and in the worst case it would take 3h30m * 7 to find out in the worst case [22:27:34] <average_drifter> we can do that also yes [22:28:15] <erosen> it's not a huge priority for me [22:28:21] <dschoon> back in 15 -- gonna finish out from home so i can use my 30" [22:28:43] <average_drifter> what would drdee do? [22:28:50] <average_drifter> drdee: what would you do? [22:29:23] <drdee> do about what? [22:29:50] <average_drifter> drdee: suppose you had 7 filters, and the problem is fixed, and you want to find out which of the filters caused the bump [22:29:59] <average_drifter> drdee: and a run costs you 3h30m [22:30:22] <average_drifter> uhm ok I guess sampling [22:30:38] <drdee> yes just run it against 1 day of data [22:31:13] <average_drifter> ok [22:44:07] <ottomata> drdee, kraigparkinson, updated 319 [22:44:11] <ottomata> flask-login .deb built, added to apt.wikimedia.org, and installed on stat1001! Yay! [22:44:13] <ottomata> yayayay! [22:44:16] <ottomata> see yas tomorrow [22:44:18] <ottomata> over and out [22:44:22] <kraigparkinson> gracias! :) [22:44:35] <YuviPanda> milimetric: too late :( I'll finish the script and mail it to you tomorrow [22:44:36] <YuviPanda> sorry [22:44:45] <DarTar> ottomata: thanks, just talking to ryan about it :) [22:44:49] <YuviPanda> neck deep in some Java memory issue debugigng [22:48:44] <erosen> ottomata: how did it go down? the debianization that is? [22:49:51] <ottomata> took a bit of back and forth [22:49:54] <ottomata> stdeb wasn't a good idea [22:50:02] <ottomata> http://wiki.debian.org/Python/TransitionToDHPython2 [22:50:06] <ottomata> we used that [22:50:33] <ottomata> https://gerrit.wikimedia.org/r/gitweb?p=operations%2Fdebs%2Fpython-flask-login.git;a=summary [22:51:04] <erosen> i see… so the final version used what method? [22:51:37] <erosen> coool [22:52:41] <erosen> just curious for when this comes up in the future [22:53:58] <ottomata> i also used git-buildpackage [22:54:07] <ottomata> http://honk.sigxcpu.org/projects/git-buildpackage/manual-html/gbp.html [22:54:07] <erosen> cool [22:56:34] <dschoon> back [23:38:04] <kraigparkinson> wb dschoon