[00:25:19] (PS2) Milimetric: Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 [02:01:19] (PS6) Nuria: [WIP] Adding removing options from browsing component [analytics/dashiki] - https://gerrit.wikimedia.org/r/157777 [08:50:58] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667#c4 (christian) It happened again on: 2014-08-24 ~16:00 (with recovery for the next reading in ganglia. Since ganglia shows a decrease of volume for that time... [08:56:28] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667#c5 (christian) (In reply to christian from comment #4) > 2014-08-24 ~16:00 (with recovery for the next reading in Wrong day. It should read 2014-08-25 ~16:00 (... [09:00:58] Analytics / Refinery: Hive is broken on stat1002 - https://bugzilla.wikimedia.org/70203#c6 (christian) NEW>RESO/FIX Works for me again (Hence closing). Thanks! [09:07:29] Analytics / Refinery: Raw webrequest partitions for 2014-08-30T03:xx:xx not marked successful - https://bugzilla.wikimedia.org/70330 (christian) NEW p:Unprio s:normal a:None For the hour 2014-08-30T03:xx:xx, none [1] of the the four sources' bucket was marked successful. What happened? (I... [09:07:42] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [09:07:58] Analytics / Refinery: Hive is broken on stat1002 - https://bugzilla.wikimedia.org/70203 (christian) [09:07:58] Analytics / Refinery: Raw webrequest partitions for 2014-08-30T03:xx:xx not marked successful - https://bugzilla.wikimedia.org/70330 (christian) [09:12:30] Analytics / Refinery: Single raw webrequest partition for 2014-09-30T06:xx:xx not marked successful - https://bugzilla.wikimedia.org/70331 (christian) NEW p:Unprio s:normal a:None The upload partition for 2014-09-30T06:xx:xx, was not was marked successful [1]. What happened? [1] ________... [09:12:30] Analytics / Refinery: Raw webrequest partitions that were not marked successful - https://bugzilla.wikimedia.org/70085 (christian) [09:16:28] Analytics / Refinery: Hive is broken on stat1002 - https://bugzilla.wikimedia.org/70203#c7 (christian) Just to keep bugs connected: (In reply to christian from comment #4) > The monitoring however runs from within the cluster. So the monitoring > is working: > > +---------------------+--------+----... [09:23:58] Analytics / Refinery: Make webrequest partition validation handle races between time and sequence numbers - https://bugzilla.wikimedia.org/69615#c6 (christian) Happened again for: 2014-08-31T15:xx:xx/2014-08-31T16:xx:xx (on upload) 2014-09-02T17:xx:xx/2014-09-02T18:xx:xx (on upload) [10:57:13] (PS3) Milimetric: Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 [13:45:42] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667#c6 (christian) (In reply to Andrew Otto from comment #3) > Today I increased kafka.queue.buffering.max.ms from 1 second to 5 seconds. > Leader election can take up t... [14:59:02] (PS4) Milimetric: Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 [14:59:45] (PS1) Milimetric: Layout graph so labels are visible [analytics/dashiki] - https://gerrit.wikimedia.org/r/158106 [16:50:17] (PS3) Milimetric: Add dashiki config generation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/157761 [16:50:42] nuria__: ^ that last patchset is good to go [16:51:34] ok, I have not done anything dashboard yet today. have an interview in 10 min, will enter feedback after and start "actual work" after thta [16:51:41] ^ milimetric [16:53:59] (CR) Milimetric: "tested in staging, files are at:" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/157761 (owner: Milimetric) [16:54:20] k nuria__ , no rush. I added a comment with links to the files the script generated in staging [16:54:44] i'll submit the corresponding dashiki patch now [18:28:13] qchris: fyi, i'm going to see if I can do some troubleshooting of the kafka log loss during leader election, i think I can force it by doing controlled shutdowns [18:28:31] Great! [18:54:01] drats, qchris, i think i can't replicate this [18:54:11] i *can* replicate produce errors on brokers when leadership changes [18:54:13] but that is expected [18:54:19] Mhmm. [18:54:21] but, i can't seem to make producers actually error out...they just retry [18:54:27] as far as I can tell [18:54:45] also, apparently varnishkafka.log hasnt' been written to since june [18:54:52] :-) [18:54:53] ori made created rsyslog module and made a typo [18:55:03] which deleted the varnishkafka pacakges' rsyslog conf [18:55:15] i fixed that now, but that means I can't correlate anything historical [18:55:25] so...we'll have to wait for another loss + leader election event [18:55:34] No worries, I guess it'll soon happen again. Right. [18:56:02] Since it mostly (always?) seems to start with analytics1021 timing out the connection to the namenode ... [18:56:17] would it work to drop all packets on that connection? [18:56:22] (For some time) [18:56:48] yeah, mostly, but not always [18:56:56] Ok. [18:57:14] timing out connection to zk* [18:57:29] hmmm, like blocking the zk connection [18:57:30] maybe, ja [18:57:36] Sorry. zk. Right. [18:57:52] but, that is similar to what I just did with the broker shutdown (except it lasted a lot longer and I don't want to do it again) [18:58:03] from the producer's viewpoint, the leader changes [18:58:13] but, there is a short amount of time when it has the old leader data [18:58:16] so, it keeps sending to it [18:58:37] I guess it's fine to wait for the next occurrence of the issue. [18:58:40] i just did a broker shutdown, and I saw the failed produce requests on the broker due to the leadership change [18:58:49] but, i can't find any varnishkafkas that actually dropped messages [18:59:50] Mhmm. [18:59:59] btw, in your bug report, you said a leader election happened around 09-02 ~10:50 [19:00:09] is it possible you mean to say around ~10:33? [19:01:03] The ~ in those timings is very flexible. Yes. [19:01:09] interesting, i'm looking at the log for that change, and see 4 seconds pass between the zk timeout and the last produce request failure [19:01:19] i think my vk buffer is 5 seconds [19:01:24] maybe it needs to be a little bit larger? [19:02:18] we've queue_buffering_max_messages is set at 500000 [19:02:32] the most active varnish serves max around 6K / second [19:02:53] shoudn't hurt to up the queue_buffering_max_ms timeout to 10 seconds instead of 5... [19:03:19] Agreed. [19:04:53] (I again checked timings. They were from ganglia (which is expected to be off a bit). There 10:12 is still good. The next reading is from 10:54 and already bad.) [19:07:24] ok, i'm going to up this to 10 seconds, and cross my fingers that we will never see election cause loss again! [19:07:57] :-D [19:08:59] qchris, this also means i'll be restarting vks, so seqs will reset to 0 [19:09:12] Yes. No worries. [19:09:54] we should add some extra checking to the _SUCCESS generator [19:10:09] Against sequence number resetting? [19:10:10] if min seq == 0, then just say its ok [19:10:14] yeah [19:10:55] I hope we do that really, really seldom. So just "hdfs dfs -touchz ..../_SUCCESS" should be simpler. [19:11:11] ha, ok. [19:19:34] ottomata: is the "queue_buffering_max_messages => 500000" sufficient in [19:19:35] https://git.wikimedia.org/blob/operations%2Fpuppet.git/5b43d1e62224f2a37aadd0aa957feeaefc560c5a/manifests%2Frole%2Fcache.pp#L510 [19:20:20] The comment above it seems to imply that we only have one cache. [19:20:28] But we're having ~100 instead. [19:21:20] I mean ... sure ... not all of the caches are running at 6000 msgs/second, so it might work out. [19:22:11] no, certainly, that's just the max that bits hit (when I last checked which was some months ago) [19:22:56] but is 6000 the "sum of all bits caches" or "just one bit cache"? [19:23:25] I'd assume "just one bits cache". [19:23:48] just one [19:23:54] and not usually that much [19:24:02] usually bits each max hover around 4K [19:24:58] So to handle bursts, we'd need: 10 (# of seconds) * 100 (# of caches) * 6000 (max msgs /second) = 6000000 msgs for [19:25:06] queue_buffering_max_messages? [19:28:41] ? no this is a per varnishkafka setting, that says how much vk is allowed to buffer [19:29:13] * qchris facepalms. [19:29:15] Right :-) [19:29:39] the only worry about setting it too high is memory usage [19:29:47] on the cache server [20:35:47] Hey ottomata, I have some broken python things since the upgrade to stat3. To RT? [20:37:46] hm, what's up? [20:38:29] I need some headers for compiling python stuff (e.g. bzip2 package in the standard library doesn't work) [20:38:35] Also, I need virtualenv. [20:39:29] "apt-get install libbz2-dev" would get me the bz2 headers [20:40:17] libbz2 shoudl be fine i think [20:40:54] -dev [20:40:58] virtualenv... [20:42:14] something wrong with virtualenv? [20:42:46] i dunno, i have a feeling there was an objection to it at some point [20:42:50] but maybe I am remembering wrong [20:43:06] i just created an RT ticket, just so I'd have a paper trail for that one [20:43:37] OK. If we can't have virtualenv, that will block my work. [20:44:42] Also, I've been working with virtualenv on WMF machines for years. It's frustrating that every time we do an upgrade, I need to fight for it again. [20:45:15] ok, you have fought for it before? do you remember the outcome? [20:45:20] It was installed [20:45:26] I have been using virtualenv. [20:45:41] hah, i mean, was there a discussion about it? reasons why not? [20:45:56] i only have a vague feeling that there was some discussion about it, around or before my time here [20:46:28] Can't find anything in my email history. [20:48:16] halfak, i'm sorta talking about this a couple of ops folks [20:48:21] any reason this can't be done in labs? [20:48:38] What can't be done in labs? [20:48:51] whatever it is you are doing with custom python things? [20:49:01] My work on the stats machines. That's what we have them for. [20:49:48] "custom python things" == routine data science [20:52:50] (i'm fighting for you!) [20:54:57] [travis-ci] wikimedia/mediawiki-extensions-EventLogging#238 (master - 0024f00 : Translation updater bot): The build was broken. [20:54:57] [travis-ci] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/efef66e561f9...0024f007ca36 [20:54:57] [travis-ci] Build details : http://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/34330146 [20:57:03] ottomata, I can't find the old conversations. I suspect there was one at the time that we switched to stat1 from internproxy. [20:57:24] halfak: will libbz2-dev be better than nothing for now? i think there will be more ops discussion about virtualenv... [21:01:32] love this, halfak: "custom python things" == routine data science" [21:01:48] :) [21:01:53] haha! what do you use virtualenv for? [21:02:07] Most python packages I need are rarely installed by default. [21:02:27] e.g. pandas, requests, pyyaml, scipy, scikit-learn, etc. [21:03:54] ah you install them on your virtual env then?, ok [21:04:32] Yup. I can do manual user install, but virtualenv makes that all much easier. [21:04:45] could you have a requirements.txt file with all those and run pip install? (that is what we do to install non defaults) [21:05:30] Sure. It turns out that I'm working with new packages (including my own) all of the time, so it wouldn't be one "requirements.txt" [21:06:20] halfak, if any of those are deb packages and we can install them, that would be better [21:07:05] ok, i see [21:07:17] ottomata, many of them are not and I'm not willing to wait for someone else to install debs when I need to try new versions (or sometimes switch back to an old version). [21:08:15] I understand where it is coming from, but if I had to wait for debs to be installed, I would be blocked on a weekly basis. That would be very bad for productivity. [21:08:27] (PS1) Milimetric: Refactor configuration and clean up code a bit [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 [21:09:16] I can always just "python setup.py install --home=~", but virtualenv makes that much easier. [21:09:50] Keeping virtualenv away does not prevent me from building my own python packages. It just makes it a royal pain to maintain. [21:22:04] (PS5) Milimetric: Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 [21:22:09] (PS2) Milimetric: Layout graph so labels are visible [analytics/dashiki] - https://gerrit.wikimedia.org/r/158106 [21:22:15] (PS2) Milimetric: Refactor configuration and clean up code a bit [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 [21:23:41] (PS3) Milimetric: Refactor configuration and clean up code a bit [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 [21:37:06] (PS7) Nuria: Adding removing projects from browsing component [analytics/dashiki] - https://gerrit.wikimedia.org/r/157777 [21:42:57] Analytics / General/Unknown: Webstatscollector counting HTTPS from ulsfo twice - https://bugzilla.wikimedia.org/70140#c2 (christian) NEW>RESO/FIX The fix is effective. The first good files are https://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-09/pagecounts-20140902-180000.gz https:/... [22:27:15] (PS1) Milimetric: Update for September meeting [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/158257 [22:27:26] (CR) Milimetric: [C: 2 V: 2] Update for September meeting [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/158257 (owner: Milimetric) [23:06:05] (CR) Nuria: [C: 2 V: 2] "Looks good, also tested in staging myself." [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/157761 (owner: Milimetric) [23:06:18] (Merged) jenkins-bot: Add dashiki config generation [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/157761 (owner: Milimetric) [23:52:54] (PS1) Yuvipanda: Make query revision text TEXT instead of varchar(4096) [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158287 [23:54:33] (PS1) Yuvipanda: Enable bootstrap's responsive behavior [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158290 [23:57:07] (CR) Yuvipanda: [C: 2] Make query revision text TEXT instead of varchar(4096) [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158287 (owner: Yuvipanda) [23:57:12] (Merged) jenkins-bot: Make query revision text TEXT instead of varchar(4096) [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158287 (owner: Yuvipanda) [23:57:17] (CR) Yuvipanda: [C: 2] Enable bootstrap's responsive behavior [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158290 (owner: Yuvipanda) [23:57:22] (Merged) jenkins-bot: Enable bootstrap's responsive behavior [analytics/quarry/web] - https://gerrit.wikimedia.org/r/158290 (owner: Yuvipanda)