[01:01:59] halfak, I needed a bit of hand holding with installing python packages [01:02:51] So on stat1002 do I need to use puppet to install python packages? [01:03:08] I'm already using virtualenv and want to use mediawiki-utilities package [01:04:12] Also there is no internet connection, so am I right in assuming that I'll have to scp source and install it? [16:10:23] halfak, are you around? [17:05:26] o/ ashwinpp [17:05:34] I'm going to be on an off for a bit. [17:05:44] if you send me a question, I'll answer when I can. [17:05:54] do I have to go through puppet to install on stat1002? [17:06:12] ok [17:06:16] [17:01] halfak, I needed a bit of hand holding with installing python packages [17:02] So on stat1002 do I need to use puppet to install python packages? [17:03] I'm already using virtualenv and want to use mediawiki-utilities package [17:04] Also there is no internet connection, so am I right in assuming that I'll have to scp source and install it? [17:06:29] oops, I hope its readable [17:07:27] Ahh. I see. [17:07:39] So, you need to set an environment variable and it will just work. [17:07:41] One sec. [17:07:59] export http_proxy=http://webproxy.eqiad.wmnet:8080 [17:07:59] export https_proxy=http://webproxy.eqiad.wmnet:8080 [17:08:07] Do that in your .bashrc [17:08:29] It's sad to use http for an https proxy, but that's what they would like us to do for requests beyond the firewall. [17:08:50] Then pip will work as expected. [17:08:59] Also, note that mediawiki-utilities is python 3.x only. [17:10:47] yes I noticed the 3.x requirement [17:11:00] I had setup my virtualenv with 3.4 [17:11:06] thanks :) [17:11:27] :) no problem [18:38:02] halfak, another tiny question. From tools-login I can access database replica. Can I access something similar from stat1002? [20:02:54] halfak, another clarification. I looked at your code and each processor handles a different xml dump file, is that correct? [22:52:42] ashwinpp, yes on the DB and yes on the xml processing. [22:52:55] * halfak digs for docs on getting at the analytics DBs. [22:56:34] bah. These are out of date. [22:56:36] * halfak updates [22:59:19] Here you go https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_slaves [23:03:57] aaargh [23:04:01] * Ironholds cries [23:05:57] halfak, good news! [23:06:05] we have massive duplication and triplication in the webrequests table [23:06:40] Boo. [23:06:46] Can't triple with a filter. [23:06:49] Must happen elsewhere. [23:07:01] Maybe the job that loads in data is duplicating records. [23:07:38] yeah, I thought we had an arithmetic series calculation in production for identifiying precisely this problem [23:07:44] sending out an email with lots of very pretty graphs [23:08:05] Nice work catching it. [23:30:32] halfak, actually I'm hoping I just don't understand sequence numbers because the alternative is kinda scary :D [23:30:44] because, note that the number of duplicates doesn't actually go to 0 after the decrease. [23:30:52] so..I don't know what that means for how many pageviews we actually have. [23:31:22] Na. if we are getting dupes and the sequence numbers betray them, then we can dedupe on etl or even earlier. [23:31:36] I think that kafka might only guarantee at least once delivery. [23:31:45] So you could always get more than one. [23:32:30] yeah, s'true [23:32:36] like, it's not an unworkable problem. [23:32:49] but I'm already going to have to go up to the C-levels and be like "our pageviews are only 2/3rds of what we thought, sorry" [23:33:03] that's /before/ it turns out N% of the "actual" pageviews are also horse-poop [23:36:50] Was the old definition suffering from the dupes too? [23:45:51] yeah, although weirdly it was the UDFs worst-hit by it [23:46:07] oh, I just realised I didn't write that into the email. Whoops.