[00:00:08] yeah, we had a process that ran away from the user last week, had to kill it and take some extreme measures as well, let me know if you need a second pair of eyes [00:02:12] HaeB: !!! That’s very sudden news. I wish you all the best and I thank you very much for how thorough you are and the many many things you’ve taught me about the movement. Hope to still see you at some wmf events, Tilman [00:02:40] and my gmail is no secret, dan.andreescu [00:09:16] HaeB: WOW! super thanks [00:09:22] robh: here! [00:09:39] no worries [07:04:39] 10Analytics: Check home/HDFS leftovers of tbayer - https://phabricator.wikimedia.org/T220575 (10elukey) p:05Triage→03Normal [07:13:09] HaeB: best of luck! o/ If you are around and want to chat on IRC, you'll find us in the usual places :) [07:39:54] Morning team [07:47:35] bonjour! [07:47:57] elukey: just read the etherpad with cumin - Very well spiced :) [07:59:41] thanks! [07:59:46] 10Analytics, 10Patch-For-Review, 10User-Elukey: Upgrade analytics cluster to Cloudera CDH 5.16.1 - https://phabricator.wikimedia.org/T218343 (10elukey) [07:59:50] still need to be completed and tested [08:00:01] but in theory we could run everything from cumin1001 [08:00:48] do you think that we need solr kite sqoop flume-ng for the workers? [08:03:07] hm - I don't think flume is needed, but sqoop is [08:03:14] About solr and kite, I wodner [08:04:39] I thought that sqoop was only needed on the client launching the jobs [08:05:13] hm - I don't know elukey - Intuitively I'd say you're right [08:06:44] anyway, not a big deal, it was ony to clean up unused things :) [08:06:49] we can keep them for the moment [08:06:52] makes sense [08:07:04] I think we currently don't use kite nor solr either [08:07:33] today I am going to update the apt repo with cdh 5.16.1's packages [08:07:41] wow [08:07:43] ok! [08:07:46] then I'll deploy them to the testing cluster and we'll see what breaks :D [08:48:35] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:48:52] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:48:52] elukey: Hi this is me --^ [08:48:55] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:03] elukey: interesting [08:49:21] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:29] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:31] or is it me? [08:49:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:50:15] here I am [08:50:22] elukey: looks like the problem is the same as the one we were having the other [08:50:25] day [08:50:32] deleting old segments? [08:50:38] I have disabled (not even deleted in deep storage) a datasource [08:51:29] Now there is something bizarre here [08:51:41] no huge load on machines, nothing special [08:52:29] same issue elukey: timeouts on brokers [08:52:50] yeah [08:53:23] seems to be a jetty issue though [08:53:28] possibly [08:53:54] this is not cool [08:54:37] also I need to roll restart them [08:54:40] they seem to be stuck [08:54:46] Yes - sorry about that elukey :( [08:57:01] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:07] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:29] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:42] joal: I am wondering why we are hitting this now.. Does it mean that, with the failing timer that you found, we have never deleted data on the last deployed version? Seems strange.. [08:57:45] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:49] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:59] (sorry maybe I am not understanding the issue) [08:58:07] elukey: we only deleted data manually previsouly, yes [08:58:28] elukey: maybe related? https://github.com/apache/incubator-druid/issues/325 [08:58:49] 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10akosiaris) >>! In T219556#5098760, @Ottomata wrote: > @akosiaris, I'd love to do this sooner rather than later. It'd make some configu... [09:00:00] joal: manually meaning via console? [09:00:05] elukey: yes [09:00:07] as opposed to via API now [09:00:08] ahhh okok [09:00:21] elukey: ah no no - Manually via API calls (only way) [09:01:19] joal: ok so what is the difference with what you did today? [09:01:26] none :( [09:01:34] * elukey cries in a corner [09:01:37] elukey: only diff seem that brokers are under load [09:01:45] not big, but regular load [09:02:03] I tried to grab a jstack dump but it failed [09:04:58] elukey: the thread I pasted is messy, but I think the idea given of [num-brokers] * [num-brokers-compute-threads] MUST BE SMALLER [historical-http-threads] makes a lot of ense [09:07:06] Except from that, I have no idea :( [09:08:52] I found https://github.com/eclipse/jetty.project/issues/235 that seems interesting [09:09:22] eventually leading to https://bugs.eclipse.org/bugs/show_bug.cgi?id=478923 [09:09:38] we have jetty 9.3.something afaics, maybe it could be useful [09:10:21] reading your issue/link [09:13:35] joal: from the task I read [num of brokers] * broker:druid.processing.numThreads > historical:druid.server.http.numThreads [09:13:53] not smaller no? [09:14:00] I wonder [09:21:01] so in theory I can see how the above could lead to "locking", namely brokers waiting on historicals [09:21:24] also say that historicals are busy for a bit deleting segments [09:21:45] it makes some sense.. but it should only be temporary [09:21:48] and self recover [09:21:56] in this case, brokers lock up [09:28:22] on druid1006 we have 296 threads for the broker and 291 for the historical [09:30:36] GC timings wise, it is interesting to see that with some load the historicals have a big constant new gen collection rate https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-30d&to=now [09:30:41] that goes up to a second [09:33:39] joal: very interesting - I can see the same stacktrace on druid1006's broker AFTER the restart [09:33:58] but it is serving traffic now [09:43:02] druid.broker.http.readTimeout: PT5M looks suspicious [09:43:33] "The timeout for data reads from Historical servers and real-time tasks." [09:47:18] but yeah the main issue seems to be that connections are getting stuck [09:47:28] mmmmm [09:55:48] druid.server.http.queueSize on historicals is also by default unbounded [09:56:48] "Size of the worker queue used by Jetty server to temporarily store incoming client connections. If this value is set and a request is rejected by jetty because queue is full then client would observe request failure with TCP connection being closed immediately with a completely empty response from server." [09:58:17] so the theory that I have in mind is now the following: [09:58:31] 1) good level of traffic flowing [09:58:52] 2) delete action issued to historicals, that slow down a bit due to deleting segments [09:59:16] 3) brokers keep sending queries, and waiting 5M before timing out [09:59:57] 4) historicals keep accumulating queries in their queue [10:03:33] hmmm the git messages are not appearing in IRC for some reason [10:04:40] fdans: you mean gerrit notifications? [10:04:52] yep sorry [10:04:57] * fdans doesn't have the best words [10:05:30] wikibugs is on our channel though [10:06:02] what repo are you referring to? [10:06:45] joal: I think that the first step anyway is to deploy your change to limit the queries time windows [10:07:09] so we can see how the GC behaves after it [10:07:43] this is probably the first time that we see druid under "real" load [10:08:12] I am also proposing to have some meaningful value for druid.server.http.queueSize [10:08:16] on the historicals [10:08:25] unbounded is a recipe for failure in my opinion [10:11:35] elukey: not a big deal at all, but wikistats 2 doesn't seem to be triggering notifications [10:12:00] fdans: was it generating them in the past or is it something happened today? [10:12:07] (trying to understand) [10:12:15] I'm not seeing them for the first time today [10:12:51] just saying it because I'm used to getting a notification whenever I push in wikistats2, but I'm not bothered by it personally [10:24:50] fdans: sure sure, good to know since probably all the gerrit notifications are now broken :( [10:24:58] I don't recall exactly where wikibugs lives [10:26:10] a-team: need to take a longer lunch break today since I also have a doctor appointment, but I'll be of course reachable via phone/messaging/etc.. [10:26:44] about AQS: I have some ideas about configs to tweak, but first I'd like to see Joseph's patch out [10:26:55] (we could pack it with the nodejs update too) [10:38:31] * elukey afk lunch + doctor [11:18:10] elukey: please excuse me not answering, got to care Naé for a while [11:18:27] I agree with the analysis elukey, let's deploy aqs when you're back :) [13:29:16] 10Analytics, 10Operations: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [13:38:56] joal: o/ [13:39:03] sorry I forgot about kids day [13:39:20] o/ [13:39:23] let's deploy when you are free this evening! [13:39:26] ottomata: o [13:39:27] o/ [13:39:28] :) [13:39:37] elukey: kids in holidays, so no kids today (except when they're sick, my mother is home caring them ;) [13:39:50] ah!!! [13:39:59] elukey: I'm soooooooo sad about the druid thing :( [13:40:00] then anytime you want :) [13:40:13] Ok let's move [13:42:00] Ah! elukey - Shouold be Marcel doing it (we said that yesterday during standup) [13:43:54] joal: true, I forgot about that. I'd prefer to start now though so we can also follow up with the rest of the configs :( [13:44:14] I suspect this issue will take a long time to be fixed completely [13:44:28] ok elukey - moving :) [13:48:01] (03PS1) 10Joal: Update aqs to ccd65c1 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795 [13:54:52] (03CR) 10Elukey: [C: 03+1] "src ccd65c11bac2630969df45df10e62c5232e6d219 looks good!" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795 (owner: 10Joal) [13:55:16] ok let's go then :) [13:55:34] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795 (owner: 10Joal) [13:57:22] joal: nodejs already upgraded as well, so we can proceed [13:57:35] awesome [13:58:43] !log Deploying AQS [13:58:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:59:24] I am checking HTTP requests on 1004 with httpry [14:01:26] joal: done with canary or still in progress? [14:03:20] elukey: I was testing the new feature on aqs1004 [14:03:24] working :) [14:03:38] elukey: I think the beloved bot will hate us for that but he [14:03:49] procedding with the rest [14:04:01] ack! [14:04:06] from httpry all good [14:04:19] elukey: my tests were positive as well [14:04:47] hey team [14:05:18] Hi mforns - Sorry we went ahead and deployed aqs (ops order - I didn't dare oppose ;) [14:05:43] ahaha not order :D [14:05:48] ;) [14:06:13] mforns: we have discovered some stuff to do after this deploy, I wanted to start earlier :( (also my bad that I had to run errand) [14:12:44] elukey, no problem of course [14:13:25] actually, I'm not feeling very well today, and my daughter is in bed with fever... [14:13:32] so you made me a favor :] [14:13:48] mforns: please rest :) [14:14:15] elukey: deploy successfully finished, and as expected, some 4XX on the edits-per-page endpoint [14:14:32] super [14:15:25] query count decresed a lot [14:15:31] and so GC timings [14:15:51] \o/ :) [14:16:28] I'm also super happy because we actually didn't reduce service on important endpoints (the ones for wikistat2) [14:16:42] joal: great job :) [14:17:12] response-time pX is also dropping - goooood :) [14:21:47] (03PS3) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) [14:21:54] (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans) [14:23:35] 10Analytics, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Ah no, checked the code, and this is indeed because Puppet CA has already signed a cert for this common name. [14:23:46] 10Analytics, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Added a note on https://wikitech.wikimedia.org/wiki/Cergen [14:26:22] joal: I am wondering when the bot will be reconfigured how latencies will look like [14:28:08] also I am reading http://druid.io/docs/0.12.3/configuration/index.html#historical [14:28:15] druid.server.http.numThreadsNumber of threads for HTTP requests.max(10, (Number of cores * 17) / 16 + 2) + 30 [14:28:16] (03PS2) 10Milimetric: Show "Loading metric details..." on async Detail [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088) [14:28:47] we have druid.server.http.numThreads: 20 [14:28:50] mforns or fdans: when you have some time, two simple reviews in your gerrit queue [14:29:42] milimetric: I'm going through them right now :) [14:29:51] so nproc is 40 (I think we have hyperthreading), so (40 * 17) /16 + 2 is more than 40 [14:29:59] milimetric: the refactor CR should be ready to merge [14:30:11] and then we need to add 30 [14:30:43] fdans: oh I thought I merged it, one sec [14:30:58] yeah, merged, couple hours ago [14:31:09] I'm looking at the time selector one now [14:31:17] elukey: I think the number of threads for historical-HTTP was too low given the number of potential demands from brokers [14:31:29] elukey: let's try that? [14:31:59] elukey: also - I have not deleted data for the datasource I disabled - So we can easily reload it, and disable it again once loaded [14:33:10] milimetric: ooo sorry, notifications are not working great on gerrit [14:33:12] joal: need to make some investigation about the current number of threads on every druid node, because it is already a lot and I am not sure why given these numbers [14:33:29] anyway, I'd also propose to lower down the broker's read timeout (to historicals) from 5M to 30S [14:33:38] elukey: probably becasue MOAR :) [14:33:45] I turned off all my gerrit notifications and just check the homepage every day, found it more reliable and easier since I'm there anyway [14:33:47] and also have a fixed amount of requests queued on the historicals, like 500 [14:34:06] elukey: this last one doesn't yet makes sense to me [14:34:13] elukey: +1 for the timeout [14:34:21] fail fast [14:34:52] joal: currently the historical have an infite backlog, so possibly the timeout could be enough now that I think about it [14:35:15] joal: proposal - what do you think about retrying now the same thing that you did this morning (if possible) to see if the brokers lock up? [14:35:26] in this way we'd know if it is traffic related or not [14:35:29] for sure [14:35:39] elukey: I'd be interested to understand how historical handle the http-backlog [14:36:00] elukey: ok - realoding data (might trigger an alarm because ok missing segments) [14:36:26] "Size of the worker queue used by Jetty server to temporarily store incoming client connections. If this value is set and a request is rejected by jetty because queue is full then client would observe request failure with TCP connection being closed immediately with a completely empty response from server." [14:37:46] hm [14:46:16] at some point I guess that TCP limits for backlog are enforced [14:46:39] but with 5m of timeout and an infined queueing (potential), stacking is pretty easy [14:47:01] on druid1004 I can see ~294 threads for the historical [14:48:01] !log restart turnilo to pick up the new nodejs runtime [14:48:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:49:46] ah also [14:49:47] druid.server.http.enableRequestLimit [14:49:51] If enabled, no requests would be queued in jetty queue and "HTTP 429 Too Many Requests" error response would be sent. [14:50:23] and druid.server.http.defaultQueryTimeout [14:50:28] Query timeout in millis, beyond which unfinished queries will be cancelled [14:50:32] that is by default 5m [14:53:07] (03CR) 10Fdans: [V: 03+2 C: 03+2] "Tested, looks good to me!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502599 (https://phabricator.wikimedia.org/T220083) (owner: 10Milimetric) [14:55:18] fdans: now it seems working right? --^ [14:55:21] fdans: could you rebase your time selector so I'm safe reviewing it? [14:55:50] milimetric: I'm on it [14:55:59] elukey: yessir! [15:25:14] (03PS4) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) [15:25:21] (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans) [15:26:07] ok now I am confused [15:26:10] druid.broker.http.numConnections=20 [15:26:23] "service":"druid/broker","host":"druid1005.eqiad.wmnet:8082","version":"0.12.3","metric":"jetty/numOpenConnections","value":35}] [15:26:35] does it mean that we are queueing? [15:27:10] but if I do [15:27:10] elukey@druid1005:~$ ps -eLf | grep broker | wc -l [15:27:10] 347 [15:27:18] not all are HTTP threads of course [15:27:30] but how is it possible that we have some many running? [15:29:03] ah probably those are processing threads [15:30:02] but we have druid.processing.numThreads=10 [15:30:05] * elukey confused [15:31:37] fdans: I think there's something weird going on with https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/499968/, it's based on an older patchset of the graph data refactor, instead of on master [15:31:49] milimetric: yeayea I'm doing that now [15:32:03] oh ok sorry [15:32:25] nono :) [15:49:39] (03PS5) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) [15:51:24] elukey: segment loading finished [15:51:30] elukey: shall I disable the datasource? [15:52:17] (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans) [15:56:14] joal: let's do it! [15:56:38] elukey: NOW [15:56:54] Donw [15:58:04] down? [15:58:04] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [15:58:13] done! sorry \l9 [15:58:16] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Built 0.2.4 deb and installed on puppetmaster1001. [15:58:16] :) [15:58:23] ahahahaha [15:58:24] okok [15:58:34] I checked logs and nothing shows up [15:58:36] 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [15:58:47] so it is traffic [15:58:47] elukey: all segments deleted from druid1004 - and no isue occurred on broker [15:58:59] traffic + disabling [15:59:04] hm [15:59:17] elukey: we can relatively easily fake traffic [15:59:18] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:59:22] Ahhhh :) [15:59:22] aaahhhhh! [15:59:28] :) [15:59:32] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:59:39] only one? [15:59:49] elukey: ganetis is in codfw too, yes? [16:00:14] i just need to create vms from ganeti2001 there? [16:00:24] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:00:32] ottomata: it is yes [16:00:37] coo [16:00:38] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:00:40] sigh [16:00:59] restarting the brokers [16:01:09] elukey: can you weait for a minute please [16:01:18] you can go sorry [16:01:25] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate should be able to configure hasty and guaranteed kafka producers individually - https://phabricator.wikimedia.org/T219032 (10Ottomata) [16:01:29] ah sorry joal [16:01:34] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate should extract event time from events and produce to kafka with timestamp - https://phabricator.wikimedia.org/T219513 (10Ottomata) [16:01:36] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10Ottomata) [16:01:39] just done [16:01:40] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:02:34] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:02:37] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:02:42] elukey: the error is super weird [16:02:50] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:04:28] joal: I am wondering if it is a jetty bug [16:04:44] elukey: good thing is, we can replicate [16:04:51] elukey: or at least we think we can [16:06:06] it is not that uber bot [16:06:12] nope [16:06:34] and I don't think it is on the historicals since restarting the brokers fixes the problem [16:08:42] I have that feeling that historical might be related - brokers timeout on historical [16:16:28] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) a:05Ottomata→03elukey [16:52:17] elukey: I assume historical http threads are related to druid-broker-clisent http threads [16:54:31] joal: ? [16:55:41] elukey: broker uses a http client to talk to historical, qwith default number of conns of 20 [16:56:09] what is the setting? [16:56:43] elukey: druid.broker.http.numConnections [16:57:29] elukey: druid.server.... is the http-server (answering users when broker, or answering groker when historical) [16:57:54] broker is bot a server (for users) and a client (for historical) [16:57:58] +h [16:57:59] both [17:00:29] ah interesting druid.server [17:00:30] okok [17:00:34] I missed the naming [17:00:58] np :) [17:01:51] my understanding though is that druid.server.http.numThreads is used on both historical and broker, but those are separate pools [17:02:16] elukey: same setting, different files [17:02:29] yes we use it in various runtime properties [17:02:33] yup [17:02:38] I am not getting your point though [17:03:00] elukey: broker uses a http client to talk to historical, qwith default number of conns of 20 [17:03:04] this started from here [17:03:39] so this setting seems not picked up by the broker, if I remove it from runtime.properties I obtain the same number of threads [17:03:42] :( [17:03:58] so, talking about http-servers (druid.server.http.numThreads) -- broker uses this one to answer users queries, and historical uses this one to answer brokers queries [17:05:25] and they also have druid.server.http.queueSize [17:06:19] yes - this is related to those components serving requests [17:07:04] the broker however, has a particular role, since from a request (served using the server), it generates requests to historical (through a http-client) [17:07:37] yep [17:07:52] And this broker-http-client is configured through druid.broker.http.numConnections [17:08:30] yep [17:08:47] there is also the read timeout that I wanted to reduce [17:08:47] if brokers have globally more capacity to query than historical have to answer, they could be overwhelmed I guess [17:09:09] especially with long timeouts [17:10:10] Testing this will be complicated :( [17:10:30] elukey: testing session tomorrow morning? [17:11:02] sure [17:11:04] I started https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502814/2/hieradata/role/common/druid/public/worker.yaml [17:11:07] with some thoughts [17:12:00] elukey: I actually think out druid architecture is wrong: sharing broker and historical seems incorrect - Let's talk tomorrow :) [17:12:39] (i know, there is no good answer to that given we have 3 machines) [17:12:47] Gone for diner team, back after [18:08:18] * elukey off! [18:20:54] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Jdlrobson) [18:33:49] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Isaac) > Is it possible that the link to the survey is being shared outside a QuickSurvey (e.g. social media)? @Jdlrobson Good point but I'm pretty sur... [19:01:34] (03PS1) 10Joal: Refactor refinery/python/druid.py [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) [19:15:47] (03PS2) 10Joal: [WIP] Refactor refinery/python/druid.py [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) [19:38:34] 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) [19:52:36] 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) On ganeti1003: ` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=network -o... [20:06:39] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: attempt to backfill eventlogging data from eventlogging-client-side topic into per schema topics - https://phabricator.wikimedia.org/T220421 (10Nuria) ahem, yes, at "glacial" speed as you can see [20:58:20] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) [20:59:08] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) Here's an strace of the staging worker just before it is killed: https://gist.github.com/ottomata/ce1b3afc6213b35678b65b6151... [21:14:09] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) There is a 5 second delay between these two syscalls: ` 21:05:00.233006 sendmsg(12, {msg_name={sa_family=AF_INET, sin_port=... [21:33:58] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Pchelolo) These 2 heartbeats are more than 10s apart: ` 21:04:48.628548 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21 21:04:59.3021... [21:44:40] 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Pchelolo) Hm, the `sendmsg` that happened before a 5-second delay was for port 9125, which is statsd_exporter. [23:58:18] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10User-DannyS712: Beta cluster: MassMessage fails with PHP fatal error - https://phabricator.wikimedia.org/T220662 (10Legoktm) p:05Triage→03Unbreak! I haven't verified myself, but if this is an issue, it's UBN. [23:58:29] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error - https://phabricator.wikimedia.org/T220662 (10Legoktm)