[07:45:42] Analytics / Tech community metrics: Wrong data at "Update time for pending reviews waiting for reviewer in days" - https://bugzilla.wikimedia.org/68436#c12 (Alvaro) The project DataValues does not appear anymore! [10:37:28] Analytics / Wikimetrics: Story: AnalyticsEng has static file with list of projects and metrics - https://bugzilla.wikimedia.org/68822 (Dan Andreescu) NEW>RESO/FIX [12:21:13] Analytics / Wikimetrics: Story:b WikimetricsUser runs 'Rolling New Active Editors' report - https://bugzilla.wikimedia.org/67459 (Dan Andreescu) a:Dan Andreescu [12:53:41] Analytics / Tech community metrics: "Volume of open changesets" graph should show reviews pending every month - https://bugzilla.wikimedia.org/70278#c9 (Alvaro) (In reply to Quim Gil from comment #8) > I still find your answer slightly confusing: > > > In order to get this number, we get the total amo... [13:20:39] qchris: hiya! [13:20:48] ottomata: Good morning! [13:20:52] in the bug, when you say "analytics1021 ganglia [13:20:52] graphs that showed a exceptional in-/decrease during that period [13:20:52] " [13:21:00] graphs of what? [13:21:17] graphs from ganglia for analytics1021 [13:21:28] I went through all graphs, [13:21:29] of cpu load? [13:21:30] oh [13:21:41] and stored those that had a spike/drop. [13:21:54] So if it happens again, we can see if the spikes/drops agree. [13:22:17] Like the issue occurring twice affects the same graphs or not. [13:22:33] also, just to confirm your last comment [13:22:44] you have a script just writing a timestamp to a file every second? [13:22:58] and every now and then there is a large gap? [13:23:27] Yes, to confirm that comment. We're having actually no data/graphs to compare against. And we had no clue what it could be. So I wanted to create some content to compare against. So at least we have a start. [13:23:37] Yes about the timestamp logging. [13:23:49] (But nothing happend during the night) [13:24:16] Run [13:24:17] cd /home/qchris ; ./filter_logs_to_interesting_events.sh [13:24:22] on analytics1021 [13:25:13] That exhibits matches for a given minute and filters away minutes that do not stick out too much. [13:25:30] The first line is an artifact of me not starting the jobs at second 00. [13:26:00] The last line is typically the current minute (which is expected to be too low) [13:39:46] qchris, are you running those and saving outputs right now? [13:39:52] Yes. [13:40:04] Unless someone killed them :-) [13:40:06] Let me check. [13:41:00] Yup. They are running. [13:41:03] ottomata: ^ [13:41:21] Should I stop them? [13:43:52] no, keep them running [13:44:02] k [13:44:04] i'm thinking about tuning those writeback sysctl parameters like suggested [13:44:14] Totally do. [13:57:13] Analytics / Visualization: Story: EEVSUser adds/removes a metric/project - https://bugzilla.wikimedia.org/68142 (Dan Andreescu) [13:57:13] Analytics / Wikimetrics: Story:b WikimetricsUser runs 'Rolling New Active Editors' report - https://bugzilla.wikimedia.org/67459 (Dan Andreescu) [13:57:13] Analytics / Wikimetrics: Story: EEVSUser downloads report with correct Http Cache Headers - https://bugzilla.wikimedia.org/68445 (Dan Andreescu) [13:57:29] Analytics / Wikimetrics: Story:c WikimetricsUser runs 'Rolling Surviving New Active Editors' report - https://bugzilla.wikimedia.org/67460 (Dan Andreescu) a:Dan Andreescu [13:57:41] Analytics / Wikimetrics: Story: AnalyticsEng has static file with list of projects and metrics - https://bugzilla.wikimedia.org/68822 (Dan Andreescu) [14:02:36] (PS1) Milimetric: Add Rolling New Active Editor [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/158620 (https://bugzilla.wikimedia.org/67459) [14:42:42] Analytics / General/Unknown: Kafka broker analytics1021 not receiving messages every now and then - https://bugzilla.wikimedia.org/69667#c11 (Andrew Otto) Yeah, strange indeed that this only happens on analytics1021. I *think* we have seen this elsewhere before, but not often. And, I think not since... [14:59:56] Analytics / Tech community metrics: Wrong data at "Update time for pending reviews waiting for reviewer in days" - https://bugzilla.wikimedia.org/68436#c13 (Quim Gil) PATC>RESO/FIX This is solved now. Thank you. [15:00:31] (PS1) Milimetric: [WIP] Add Rolling Surviving New Active Editor [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/158630 (https://bugzilla.wikimedia.org/67460) [15:04:08] nuria [15:04:14] you don't need to keep changing the default value [15:04:21] that won't fix it for you anyway [15:04:48] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L121 [15:11:58] nuria: I also don't think you should pick 24. 24 slow queries is what will bring the entire machine to a halt, and you should probably aim for lower :) [16:08:59] ottomata: right, that setting should not be there I don't think right? [16:10:06] Yuvipanda: If every query you do is slow it dpoesn't matter what is your concurrency level, the machine will get to halt regardless, right? as in our case there [16:10:15] are no processes that do not do queries [16:10:35] nuria: no, if you do only 12 queries and they are all slow, there will still be 12 other slots open for other people's queries [16:10:49] because there are 24 cores on that machine and mysql does only one core per query-thread [16:11:58] if 12 random queries are slow, the other 12 people's random queries will also likely be slow, there is not a huge variety of queries here [16:12:09] they are all "pre-canned" [16:12:16] and very similar in nature [16:13:05] it could be taht the db you are trying to connect si slow (w/ other queries running) [16:13:07] sure, but if wikimetrics uses all 24 query slots, there are no other slots for the queries from toollabs/quarry/pthers... [16:13:10] *others [16:13:28] then they will be slow even if they'd have otherwise been fast [16:13:41] YuviPanda: but labs users are given 512 concurrent connections [16:13:51] that sounds bad if just 24 concurrent can starve the servers [16:13:53] wait, the 24 are per labs instance, are they not? [16:13:54] milimetric: as springle said, that's a theoretical maximum :D [16:14:01] lol [16:14:07] to prevent DDoS [16:14:09] so people should be assigned like 5 or so then [16:14:26] indeed, but again as springle said, he hopes people don't keep more than that open :D [16:14:36] and a burts of 24 queries for 5min isn't going to cause a problem [16:14:41] cause 24 concurrent connections for everyone in labs to share sounds much too small [16:14:46] but if you keep 24 connections open running queries all the time... [16:15:11] nuria: no, it's not 24 concurrent connections, but 24 cores on that machine, so once you go more than one it will degrade for other users [16:15:12] no, that is not the case [16:15:13] oh but wikimetrics won't keep that open all the time unless it gets more popular by a few orders of magnitude [16:15:32] yuvipanda: we do not keep 24 running at all times [16:15:34] in which case we'd separate the normal user processing into another celery process and give it fewer concurrent [16:15:42] aaah [16:15:44] we are setting the maximum [16:15:45] the recurrent reports do use 24 at a time [16:15:47] aaah [16:15:55] but that's just for like 1 hour at night when nobody's hitting the system anyway [16:15:57] for a small amount of time [16:16:06] I might've gotten a bit too paranoid then :) [16:16:21] sok, good that we're all thinking about the interactions of our different medicines :) [16:16:31] ok, so labs is just 1 machine with 24 cores and a bunch of vms on top of it, right? [16:17:09] btw though, we've had 100 concurrent and that ran fine [16:19:46] nuria: no, this is the labsdb machine, 24 cores and only mysql [16:19:56] ah ok [16:21:03] yuivipanda: it's good that you were keeping an eye on teh settings though [16:21:15] *Yuvipanda [16:21:28] :) [16:23:44] ottomata: are you back? [16:24:27] yup [16:24:29] hiay [16:25:48] ottomata: could we specify the concurrency in the template like <%= @debug ? '24' : '16' %> [16:26:11] and remove teh setting from https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L121 [16:26:18] or is that not optimal? [16:26:40] i *think* you can do that [16:26:49] but [16:26:51] why would you? [16:27:22] really, the value you should put in the module should have nothing to do with where you are deploying it. it should just be a sane default [16:27:33] e.g., what would be good for wikimetrics in vagrant? [16:27:35] or something like that [16:29:09] ottomata: but all our values have two settings , debug and not, right? [16:31:38] ottomata: so you think is best to have values specified in wikimetrics.pp and init.pp (if so that is fine) [16:31:58] (ah, sorry, in 1:1 w t) [17:11:12] hm, well, i mean, i do think it is weird that 'debug' actually changes runtime configs [17:11:26] 'debug' to me means debug logging, or extra information in order to debug problems [17:11:39] i suppose we've just overloaded the meaning of that variable beyond 'debug' [17:11:49] now it means more like 'staging-test-settings' [17:11:52] or [17:11:55] 'non-production' [17:12:26] (nuria ^ >) [17:12:40] i thikn I mind its overloaded meaning less in the role class [17:12:40] holaa ottomata [17:12:46] since that is the usage of the module [17:12:52] but the module itself shoudl be a little pur-er i think [17:12:58] and not conflate the meaning if possible [17:14:02] ok, so what changes shoudl we do, do we change the '100' in the module? [17:14:18] i think you should leave 10 as the default, and just set what you want based on the environment you are running in in the role [17:14:28] the user of the class should override values, the default module values shoudln't care [17:14:46] decouple your environment configurations from your default module usage [17:14:57] but teh environment you are running is determined just by @debug [17:15:06] there is no other indication [17:15:45] i know, i don't like it :P [17:15:57] so you mean @debug ?'blah':'other-blah' [17:16:34] this is good: [17:16:34] DEBUG : <%= @debug ? 'True' : 'False' %> [17:16:34] LOG_LEVEL : <%= @debug ? 'DEBUG' : 'INFO' %> [17:16:38] makes sense [17:16:42] this does not make sense [17:16:43] CELERY_TASK_RESULT_EXPIRES : <%= @debug ? 3600 : 2592000 %> [17:17:02] but [17:17:02] i think [17:17:05] if i were to do it over: [17:17:29] you could conditionally select these config values in the ROLE class based on a $wikimetrics_environment variable (not $debug) [17:17:45] and the module would just have sane (hardcoded?) defaults for the actual config values [17:18:02] if you need to set environment specific configs, then you override them in the instance usage of the class, not the base default values [17:18:10] if you were writing an actual OO class somewhere [17:18:28] you wouldn't make the class configure itself based on environment specific vars [17:18:34] e.g., you wouldnt' ever do something like: [17:19:31] if (!debug) { ServerName = 'metrics.wmflabs.org' } else { ServerName = 'metrics-staging.wmflabs.org' } [17:19:32] so [17:19:45] (in the module) [17:19:53] so why would you set other environment specific settings in the module [17:20:23] i understand but you also understand that i am not going to change the wikimetrics module just to change the currency level, right? [17:20:35] riiiight, but that is waht you are trying to do right now [17:20:49] https://gerrit.wikimedia.org/r/#/c/158629/2/manifests/init.pp [17:20:55] you are changing the wikimetrics module [17:21:12] when you already have the ability to override the default in the role class [17:21:13] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L121 [17:21:41] i can get away with swaping the '100' to '24' actually on manifests/role/wikimetrrics.pp [17:21:44] yes [17:21:55] ok, will send that change along [17:22:09] this is the actual usage of that variable [17:22:10] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/wikimetrics.pp#L160 [17:22:15] it overrides the default in the module [17:22:37] you'd ahve to change that anyway to get your setting the way you want [17:22:59] so i'm just saying: why change the module at all? :) [17:22:59] (PS2) Milimetric: Add Rolling Surviving New Active Editor [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/158630 (https://bugzilla.wikimedia.org/67460) [17:23:46] ottomata: understood [17:23:52] cool :) [17:30:02] nuria: both metrics done! [17:30:03] :D [17:33:36] milimetric: i am trying to test a bit the heap and caching of promises [17:33:53] need to test some more and will let you know [17:35:45] nuria: do you mean the caching here: https://gerrit.wikimedia.org/r/#/c/158004/ ? [17:35:54] because i haven't touched the other review, was gonna look at that now [17:36:30] ack, gotta run to post office, be bask asap [17:37:51] milimetric: yes [17:38:16] It's keeping the ones that are attached to graphs on the page. Without them, I'd have to write more complicated code to keep the results around, and I couldn't even figure out how to do it [17:38:24] *lines on the graph on the page [17:40:27] qchris: do you guys have the ability to quickly scan for suspicious changes in the pattern of requests made to bits since ~16:20UTC? [17:40:53] ori: What kind of changes should we look for? [17:41:14] dunno yet, maybe a spike in some user agent or a url pattern [17:41:49] Ok. I'll try poke around a bit. [17:45:47] qchris: if you're trying to have a normal Friday I can have a look [17:46:02] milimetric: Look too :-) [17:46:12] milimetric: 4 eyes see more than 2. [17:47:06] k, let's see what this cluster can do [17:58:34] qchris: we don't have data in hadoop since 16:00, right? [17:58:47] It is in hadoop. But not yet in hive. [17:58:51] right [17:59:07] You need to create a table and add the partition there to point at that data. [18:02:24] milimetric: You can use "qchris.webrequest" (that has only the relevant partition) [18:03:30] cool :) [18:08:37] oops - i gotta run out - i'm running some queries, will do more when i get back [18:18:03] ori: Maybe I am missing the elephant in the room, but I cannot find anything that sticks out. [18:19:15] I mean ... [18:19:22] 304s increased a bit. [18:19:26] So did 503s [18:22:10] (User Agents for the 503s are all over the place, also after brower identification) [18:24:06] 503s are basically to bits.wikimedia.org/$WIKI/load.php?... [18:25:50] Meh. That's not useful. [18:29:53] ori: quite some 503s have "skin=vector?" (so vector with question mark appended) [18:51:15] qchris: hmm. i don't think that's it. :/ there was a spike in varnish sessions and now it went away [18:51:25] thanks for looking [18:51:42] Sorry for not finding anything :-( [18:52:33] (The 503 vanished again after 40 minutes) [19:02:59] ori: does an addition of a new cookie cause a spike in varnish sessions? [19:38:39] (PS8) Nuria: Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 (owner: Milimetric) [19:39:00] milmetric: fixed 1 small bug i think this is reday to merge [19:40:36] ^milimetric [19:40:57] trying again: fixed 1 small bug , i think is ready to merge [19:44:57] (CR) Milimetric: [C: 2 V: 2] Fix Visualization of multiple lines [analytics/dashiki] - https://gerrit.wikimedia.org/r/158004 (owner: Milimetric) [19:45:02] k nuria, thx [19:45:13] i have to get whatever js formatter you're using i think [19:45:17] otherwise we'll have format wars [19:45:18] :) [19:45:35] but wait , what got reformatted? let me see [19:47:19] ah i see, we have different for matters for json [19:48:40] ok, let's just submit this one and pick a ciommon formatter [19:48:46] *pick a common [19:51:05] ottomata: you tricked me [19:51:19] it's not a tsv, it just has 4 spaces [19:51:37] ah! [19:51:47] milimetric: what if you had a csv with 3 columns? [19:51:52] rs, avgqu-sz and await? [19:51:59] that's fine, because it looks like vega lets you pick the columns [19:52:41] like [19:52:42] http://noc.wikimedia.org/~otto/elastic/1015/rs_avgqu-sz_await.csv [19:52:51] ah no headers, one sec.. [19:53:43] ok [19:53:45] now check it [19:53:59] milimetric: i think don't worry about it, i just wanted to do it fast in vega if possible..i loaded into a spreadsheet :/ [19:54:18] ottomata: yeah, vega's not fast at all [19:54:22] it's just complete [19:58:26] (CR) Nuria: Refactor configuration and clean up code a bit (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 (owner: Milimetric) [20:00:04] (PS4) Milimetric: Layout graph so labels are visible [analytics/dashiki] - https://gerrit.wikimedia.org/r/158106 [20:03:41] (CR) Nuria: [C: 2 V: 2] Layout graph so labels are visible [analytics/dashiki] - https://gerrit.wikimedia.org/r/158106 (owner: Milimetric) [20:29:42] (PS4) Milimetric: Refactor configuration and clean up code a bit [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 [20:54:12] Analytics / Refinery: Make webrequest partition validation handle races between time and sequence numbers - https://bugzilla.wikimedia.org/69615#c7 (christian) Happened again for: 2014-09-04T22:xx:xx/2014-09-04T23:xx:xx (on upload) [21:01:25] (CR) Nuria: Refactor configuration and clean up code a bit (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/158244 (owner: Milimetric) [21:13:00] Analytics / General/Unknown: X-Analytics header is "php=zend;php=zend" instead of "php=zend" on bits for some requests - https://bugzilla.wikimedia.org/70463 (christian) NEW p:Unprio s:normal a:None For some requests of the bits caches from esams and ulso, X-Analytics headers are php=zen... [21:44:53] have a good weekend everyone [21:48:15] laters! [21:48:34] Enjoy your weekends :-)