[00:00:08] <milimetric>	 yeah, we had a process that ran away from the user last week, had to kill it and take some extreme measures as well, let me know if you need a second pair of eyes
[00:02:12] <milimetric>	 HaeB: !!! That’s very sudden news.  I wish you all the best and I thank you very much for how thorough you are and the many many things you’ve taught me about the movement.  Hope to still see you at some wmf events, Tilman
[00:02:40] <milimetric>	 and my gmail is no secret, dan.andreescu
[00:09:16] <nuria>	 HaeB: WOW! super thanks
[00:09:22] <nuria>	 robh: here!
[00:09:39] <robh>	 no worries
[07:04:39] <wikibugs>	 10Analytics: Check home/HDFS leftovers of tbayer - https://phabricator.wikimedia.org/T220575 (10elukey) p:05Triage→03Normal
[07:13:09] <elukey>	 HaeB: best of luck! o/ If you are around and want to chat on IRC, you'll find us in the usual places :)
[07:39:54] <joal>	 Morning team
[07:47:35] <elukey>	 bonjour!
[07:47:57] <joal>	 elukey: just read the etherpad with cumin - Very well spiced :)
[07:59:41] <elukey>	 thanks!
[07:59:46] <wikibugs>	 10Analytics, 10Patch-For-Review, 10User-Elukey: Upgrade analytics cluster to Cloudera CDH 5.16.1 - https://phabricator.wikimedia.org/T218343 (10elukey)
[07:59:50] <elukey>	 still need to be completed and tested
[08:00:01] <elukey>	 but in theory we could run everything from cumin1001
[08:00:48] <elukey>	 do you think that we need solr kite sqoop flume-ng for the workers?
[08:03:07] <joal>	 hm - I don't think flume is needed, but sqoop is
[08:03:14] <joal>	 About solr and kite, I wodner
[08:04:39] <elukey>	 I thought that sqoop was only needed on the client launching the jobs
[08:05:13] <joal>	 hm - I don't know elukey - Intuitively I'd say you're right
[08:06:44] <elukey>	 anyway, not a big deal, it was ony to clean up unused things :)
[08:06:49] <elukey>	 we can keep them for the moment
[08:06:52] <joal>	 makes sense
[08:07:04] <joal>	 I think we currently don't use kite nor solr either
[08:07:33] <elukey>	 today I am going to update the apt repo with cdh 5.16.1's packages
[08:07:41] <joal>	 wow
[08:07:43] <joal>	 ok!
[08:07:46] <elukey>	 then I'll deploy them to the testing cluster and we'll see what breaks :D
[08:48:35] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:48:52] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:48:52] <joal>	 elukey: Hi this is me --^
[08:48:55] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:49:03] <joal>	 elukey: interesting
[08:49:21] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:49:29] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:49:31] <joal>	 or is it me?
[08:49:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:50:15] <elukey>	 here I am
[08:50:22] <joal>	 elukey: looks like the problem is the same as the one we were having the other
[08:50:25] <joal>	 day
[08:50:32] <elukey>	 deleting old segments?
[08:50:38] <joal>	 I have disabled (not even deleted in deep storage) a datasource
[08:51:29] <joal>	 Now there is something bizarre here
[08:51:41] <joal>	 no huge load on machines, nothing special
[08:52:29] <joal>	 same issue elukey: timeouts on brokers
[08:52:50] <elukey>	 yeah
[08:53:23] <elukey>	 seems to be a jetty issue though
[08:53:28] <joal>	 possibly
[08:53:54] <joal>	 this is not cool
[08:54:37] <elukey>	 also I need to roll restart them
[08:54:40] <elukey>	 they seem to be stuck
[08:54:46] <joal>	 Yes - sorry about that elukey :(
[08:57:01] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:07] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:13] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:29] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:42] <elukey>	 joal: I am wondering why we are hitting this now.. Does it mean that, with the failing timer that you found, we have never deleted data on the last deployed version? Seems strange..
[08:57:45] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:49] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[08:57:59] <elukey>	 (sorry maybe I am not understanding the issue)
[08:58:07] <joal>	 elukey: we only deleted data manually previsouly, yes
[08:58:28] <joal>	 elukey: maybe related? https://github.com/apache/incubator-druid/issues/325
[08:58:49] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10akosiaris) >>! In T219556#5098760, @Ottomata wrote: > @akosiaris, I'd love to do this sooner rather than later.  It'd make some configu...
[09:00:00] <elukey>	 joal: manually meaning via console?
[09:00:05] <joal>	 elukey: yes
[09:00:07] <elukey>	 as opposed to via API now
[09:00:08] <elukey>	 ahhh okok
[09:00:21] <joal>	 elukey: ah no no - Manually via API calls (only way)
[09:01:19] <elukey>	 joal: ok so what is the difference with what you did today? 
[09:01:26] <joal>	 none :(
[09:01:34] * elukey cries in a corner
[09:01:37] <joal>	 elukey: only diff seem that brokers are under load
[09:01:45] <joal>	 not big, but regular load
[09:02:03] <elukey>	 I tried to grab a jstack dump but it failed
[09:04:58] <joal>	 elukey: the thread I pasted is messy, but I think the idea given of [num-brokers] * [num-brokers-compute-threads] MUST BE SMALLER [historical-http-threads] makes a lot of ense
[09:07:06] <joal>	 Except from that, I have no idea :(
[09:08:52] <elukey>	 I found https://github.com/eclipse/jetty.project/issues/235 that seems interesting
[09:09:22] <elukey>	 eventually leading to https://bugs.eclipse.org/bugs/show_bug.cgi?id=478923
[09:09:38] <elukey>	 we have jetty 9.3.something afaics, maybe it could be useful
[09:10:21] <elukey>	 reading your issue/link
[09:13:35] <elukey>	 joal: from the task I read [num of brokers] * broker:druid.processing.numThreads > historical:druid.server.http.numThreads
[09:13:53] <elukey>	 not smaller no?
[09:14:00] <joal>	 I wonder
[09:21:01] <elukey>	 so in theory I can see how the above could lead to "locking", namely brokers waiting on historicals
[09:21:24] <elukey>	 also say that historicals are busy for a bit deleting segments
[09:21:45] <elukey>	 it makes some sense.. but it should only be temporary 
[09:21:48] <elukey>	 and self recover
[09:21:56] <elukey>	 in this case, brokers lock up 
[09:28:22] <elukey>	 on druid1006 we have 296 threads for the broker and 291 for the historical 
[09:30:36] <elukey>	 GC timings wise, it is interesting to see that with some load the historicals have a big constant new gen collection rate https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-30d&to=now
[09:30:41] <elukey>	 that goes up to a second
[09:33:39] <elukey>	 joal: very interesting - I can see the same stacktrace on druid1006's broker AFTER the restart
[09:33:58] <elukey>	 but it is serving traffic now
[09:43:02] <elukey>	 druid.broker.http.readTimeout: PT5M looks suspicious
[09:43:33] <elukey>	 "The timeout for data reads from Historical servers and real-time tasks."
[09:47:18] <elukey>	 but yeah the main issue seems to be that connections are getting stuck
[09:47:28] <elukey>	 mmmmm
[09:55:48] <elukey>	 druid.server.http.queueSize on historicals is also by default unbounded
[09:56:48] <elukey>	 "Size of the worker queue used by Jetty server to temporarily store incoming client connections. If this value is set and a request is rejected by jetty because queue is full then client would observe request failure with TCP connection being closed immediately with a completely empty response from server."
[09:58:17] <elukey>	 so the theory that I have in mind is now the following:
[09:58:31] <elukey>	 1) good level of traffic flowing
[09:58:52] <elukey>	 2) delete action issued to historicals, that slow down a bit due to deleting segments
[09:59:16] <elukey>	 3) brokers keep sending queries, and waiting 5M before timing out 
[09:59:57] <elukey>	 4) historicals keep accumulating queries in their queue
[10:03:33] <fdans>	 hmmm the git messages are not appearing in IRC for some reason
[10:04:40] <elukey>	 fdans: you mean gerrit notifications?
[10:04:52] <fdans>	 yep sorry
[10:04:57] * fdans doesn't have the best words
[10:05:30] <elukey>	 wikibugs is on our channel though
[10:06:02] <elukey>	 what repo are you referring to?
[10:06:45] <elukey>	 joal: I think that the first step anyway is to deploy your change to limit the queries time windows
[10:07:09] <elukey>	 so we can see how the GC behaves after it
[10:07:43] <elukey>	 this is probably the first time that we see druid under "real" load
[10:08:12] <elukey>	 I am also proposing to have some meaningful value for druid.server.http.queueSize
[10:08:16] <elukey>	 on the historicals
[10:08:25] <elukey>	 unbounded is a recipe for failure in my opinion
[10:11:35] <fdans>	 elukey: not a big deal at all, but wikistats 2 doesn't seem to be triggering notifications
[10:12:00] <elukey>	 fdans: was it generating them in the past or is it something happened today?
[10:12:07] <elukey>	 (trying to understand)
[10:12:15] <fdans>	 I'm not seeing them for the first time today
[10:12:51] <fdans>	 just saying it because I'm used to getting a notification whenever I push in wikistats2, but I'm not bothered by it personally
[10:24:50] <elukey>	 fdans: sure sure, good to know since probably all the gerrit notifications are now broken :(
[10:24:58] <elukey>	 I don't recall exactly where wikibugs lives
[10:26:10] <elukey>	 a-team: need to take a longer lunch break today since I also have a doctor appointment, but I'll be of course reachable via phone/messaging/etc..
[10:26:44] <elukey>	 about AQS: I have some ideas about configs to tweak, but first I'd like to see Joseph's patch out
[10:26:55] <elukey>	 (we could pack it with the nodejs update too)
[10:38:31] * elukey afk lunch + doctor
[11:18:10] <joal>	 elukey: please excuse me not answering, got to care Naé for a while
[11:18:27] <joal>	 I agree with the analysis elukey, let's deploy aqs when you're back :)
[13:29:16] <wikibugs>	 10Analytics, 10Operations: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata)
[13:38:56] <elukey>	 joal: o/
[13:39:03] <elukey>	 sorry I forgot about kids day
[13:39:20] <ottomata>	 o/
[13:39:23] <elukey>	 let's deploy when you are free this evening!
[13:39:26] <elukey>	 ottomata: o
[13:39:27] <elukey>	 o/
[13:39:28] <elukey>	 :)
[13:39:37] <joal>	 elukey: kids in holidays, so no kids today (except when they're sick, my mother is home caring them ;)
[13:39:50] <elukey>	 ah!!!
[13:39:59] <joal>	 elukey: I'm soooooooo sad about the druid thing :(
[13:40:00] <elukey>	 then anytime you want :)
[13:40:13] <joal>	 Ok let's move
[13:42:00] <joal>	 Ah! elukey - Shouold be Marcel doing it (we said that yesterday during standup)
[13:43:54] <elukey>	 joal: true, I forgot about that. I'd prefer to start now though so we can also follow up with the rest of the configs :(
[13:44:14] <elukey>	 I suspect this issue will take a long time to be fixed completely
[13:44:28] <joal>	 ok elukey - moving :)
[13:48:01] <wikibugs>	 (03PS1) 10Joal: Update aqs to ccd65c1 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795
[13:54:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "src  ccd65c11bac2630969df45df10e62c5232e6d219 looks good!" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795 (owner: 10Joal)
[13:55:16] <joal>	 ok let's go then :)
[13:55:34] <wikibugs>	 (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/502795 (owner: 10Joal)
[13:57:22] <elukey>	 joal: nodejs already upgraded as well, so we can proceed
[13:57:35] <joal>	 awesome
[13:58:43] <joal>	 !log Deploying AQS
[13:58:44] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:59:24] <elukey>	 I am checking HTTP requests on 1004 with httpry
[14:01:26] <elukey>	 joal: done with canary or still in progress?
[14:03:20] <joal>	 elukey: I was testing the new feature on aqs1004
[14:03:24] <joal>	 working :)
[14:03:38] <joal>	 elukey: I think the beloved bot will hate us for that but he
[14:03:49] <joal>	 procedding with the rest
[14:04:01] <elukey>	 ack!
[14:04:06] <elukey>	 from httpry all good
[14:04:19] <joal>	 elukey: my tests were positive as well
[14:04:47] <mforns>	 hey team
[14:05:18] <joal>	 Hi mforns - Sorry we went ahead and deployed aqs (ops order - I didn't dare oppose ;)
[14:05:43] <elukey>	 ahaha not order :D
[14:05:48] <joal>	 ;)
[14:06:13] <elukey>	 mforns: we have discovered some stuff to do after this deploy, I wanted to start earlier :( (also my bad that I had to run errand)
[14:12:44] <mforns>	 elukey, no problem of course
[14:13:25] <mforns>	 actually, I'm not feeling very well today, and my daughter is in bed with fever...
[14:13:32] <mforns>	 so you made me a favor :]
[14:13:48] <joal>	 mforns: please rest :)
[14:14:15] <joal>	 elukey: deploy successfully finished, and as expected, some 4XX on the edits-per-page endpoint
[14:14:32] <elukey>	 super
[14:15:25] <elukey>	 query count decresed a lot 
[14:15:31] <elukey>	 and so GC timings
[14:15:51] <joal>	 \o/ :)
[14:16:28] <joal>	 I'm also super happy because we actually didn't reduce service on important endpoints (the ones for wikistat2)
[14:16:42] <elukey>	 joal: great job :)
[14:17:12] <joal>	 response-time pX is also dropping - goooood :)
[14:21:47] <wikibugs>	 (03PS3) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112)
[14:21:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans)
[14:23:35] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Ah no, checked the code, and this is indeed because Puppet CA has already signed a cert for this common name.
[14:23:46] <wikibugs>	 10Analytics, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Added a note on https://wikitech.wikimedia.org/wiki/Cergen
[14:26:22] <elukey>	 joal: I am wondering when the bot will be reconfigured how latencies will look like
[14:28:08] <elukey>	 also I am reading http://druid.io/docs/0.12.3/configuration/index.html#historical
[14:28:15] <elukey>	 druid.server.http.numThreadsNumber of threads for HTTP requests.max(10, (Number of cores * 17) / 16 + 2) + 30
[14:28:16] <wikibugs>	 (03PS2) 10Milimetric: Show "Loading metric details..." on async Detail [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502623 (https://phabricator.wikimedia.org/T220088)
[14:28:47] <elukey>	 we have druid.server.http.numThreads: 20
[14:28:50] <milimetric>	 mforns or fdans: when you have some time, two simple reviews in your gerrit queue
[14:29:42] <fdans>	 milimetric: I'm going through them right now :)
[14:29:51] <elukey>	 so nproc is 40 (I think we have hyperthreading), so (40 * 17) /16 + 2 is more than 40
[14:29:59] <fdans>	 milimetric: the refactor CR should be ready to merge
[14:30:11] <elukey>	 and then we need to add 30
[14:30:43] <milimetric>	 fdans: oh I thought I merged it, one sec
[14:30:58] <milimetric>	 yeah, merged, couple hours ago
[14:31:09] <milimetric>	 I'm looking at the time selector one now
[14:31:17] <joal>	 elukey: I think the number of threads for historical-HTTP was too low given the number of potential demands from brokers
[14:31:29] <joal>	 elukey: let's try that?
[14:31:59] <joal>	 elukey: also - I have not deleted data for the datasource I disabled - So we can easily reload it, and disable it again once loaded
[14:33:10] <fdans>	 milimetric: ooo sorry, notifications are not working great on gerrit
[14:33:12] <elukey>	 joal: need to make some investigation about the current number of threads on every druid node, because it is already a lot and I am not sure why given these numbers
[14:33:29] <elukey>	 anyway, I'd also propose to lower down the broker's read timeout (to historicals) from 5M to 30S
[14:33:38] <joal>	 elukey: probably becasue MOAR :)
[14:33:45] <milimetric>	 I turned off all my gerrit notifications and just check the homepage every day, found it more reliable and easier since I'm there anyway
[14:33:47] <elukey>	 and also have a fixed amount of requests queued on the historicals, like 500
[14:34:06] <joal>	 elukey: this last one doesn't yet makes sense to me
[14:34:13] <joal>	 elukey: +1 for the timeout
[14:34:21] <joal>	 fail fast
[14:34:52] <elukey>	 joal: currently the historical have an infite backlog, so possibly the timeout could be enough now that I think about it
[14:35:15] <elukey>	 joal: proposal - what do you think about retrying now the same thing that you did this morning (if possible) to see if the brokers lock up?
[14:35:26] <elukey>	 in this way we'd know if it is traffic related or not
[14:35:29] <elukey>	 for sure
[14:35:39] <joal>	 elukey: I'd be interested to understand how historical handle the http-backlog
[14:36:00] <joal>	 elukey: ok - realoding data (might trigger an alarm because ok missing segments)
[14:36:26] <elukey>	 "Size of the worker queue used by Jetty server to temporarily store incoming client connections. If this value is set and a request is rejected by jetty because queue is full then client would observe request failure with TCP connection being closed immediately with a completely empty response from server."
[14:37:46] <joal>	 hm
[14:46:16] <elukey>	 at some point I guess that TCP limits for backlog are enforced
[14:46:39] <elukey>	 but with 5m of timeout and an infined queueing (potential), stacking is pretty easy
[14:47:01] <elukey>	 on druid1004 I can see ~294 threads for the historical
[14:48:01] <elukey>	 !log restart turnilo to pick up the new nodejs runtime
[14:48:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:49:46] <elukey>	 ah also
[14:49:47] <elukey>	 druid.server.http.enableRequestLimit
[14:49:51] <elukey>	 If enabled, no requests would be queued in jetty queue and "HTTP 429 Too Many Requests" error response would be sent.
[14:50:23] <elukey>	 and druid.server.http.defaultQueryTimeout
[14:50:28] <elukey>	 Query timeout in millis, beyond which unfinished queries will be cancelled
[14:50:32] <elukey>	 that is by default 5m
[14:53:07] <wikibugs>	 (03CR) 10Fdans: [V: 03+2 C: 03+2] "Tested, looks good to me!" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/502599 (https://phabricator.wikimedia.org/T220083) (owner: 10Milimetric)
[14:55:18] <elukey>	 fdans: now it seems working right? --^
[14:55:21] <milimetric>	 fdans: could you rebase your time selector so I'm safe reviewing it?
[14:55:50] <fdans>	 milimetric: I'm on it
[14:55:59] <fdans>	 elukey: yessir!
[15:25:14] <wikibugs>	 (03PS4) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112)
[15:25:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans)
[15:26:07] <elukey>	 ok now I am confused
[15:26:10] <elukey>	 druid.broker.http.numConnections=20 
[15:26:23] <elukey>	 "service":"druid/broker","host":"druid1005.eqiad.wmnet:8082","version":"0.12.3","metric":"jetty/numOpenConnections","value":35}]
[15:26:35] <elukey>	 does it mean that we are queueing?
[15:27:10] <elukey>	 but if I do 
[15:27:10] <elukey>	 elukey@druid1005:~$ ps -eLf | grep broker | wc -l
[15:27:10] <elukey>	 347
[15:27:18] <elukey>	 not all are HTTP threads of course
[15:27:30] <elukey>	 but how is it possible that we have some many running?
[15:29:03] <elukey>	 ah probably those are processing threads
[15:30:02] <elukey>	 but we have druid.processing.numThreads=10
[15:30:05] * elukey confused
[15:31:37] <milimetric>	 fdans: I think there's something weird going on with https://gerrit.wikimedia.org/r/#/c/analytics/wikistats2/+/499968/, it's based on an older patchset of the graph data refactor, instead of on master
[15:31:49] <fdans>	 milimetric: yeayea I'm doing that now
[15:32:03] <milimetric>	 oh ok sorry
[15:32:25] <fdans>	 nono :)
[15:49:39] <wikibugs>	 (03PS5) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112)
[15:51:24] <joal>	 elukey: segment loading finished
[15:51:30] <joal>	 elukey: shall I disable the datasource?
[15:52:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans)
[15:56:14] <elukey>	 joal: let's do it!
[15:56:38] <joal>	 elukey: NOW
[15:56:54] <joal>	 Donw
[15:58:04] <elukey>	 down?
[15:58:04] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata)
[15:58:13] <joal>	 done! sorry \l9
[15:58:16] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Built 0.2.4 deb and installed on puppetmaster1001.
[15:58:16] <joal>	 :)
[15:58:23] <elukey>	 ahahahaha
[15:58:24] <elukey>	 okok
[15:58:34] <elukey>	 I checked logs and nothing shows up
[15:58:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata)
[15:58:47] <elukey>	 so it is traffic
[15:58:47] <joal>	 elukey: all segments deleted from druid1004 - and no isue occurred on broker
[15:58:59] <joal>	 traffic + disabling
[15:59:04] <joal>	 hm
[15:59:17] <joal>	 elukey: we can relatively easily fake traffic
[15:59:18] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:59:22] <joal>	 Ahhhh :)
[15:59:22] <elukey>	 aaahhhhh!
[15:59:28] <joal>	 :)
[15:59:32] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:59:39] <elukey>	 only one?
[15:59:49] <ottomata>	 elukey:  ganetis is in codfw too, yes?
[16:00:14] <ottomata>	 i just need to create vms from ganeti2001 there?
[16:00:24] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:00:32] <elukey>	 ottomata: it is yes
[16:00:37] <ottomata>	 coo
[16:00:38] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:00:40] <elukey>	 sigh
[16:00:59] <elukey>	 restarting the brokers
[16:01:09] <joal>	 elukey: can you weait for a minute please
[16:01:18] <joal>	 you can go sorry
[16:01:25] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate should be able to configure hasty and guaranteed kafka producers individually - https://phabricator.wikimedia.org/T219032 (10Ottomata)
[16:01:29] <elukey>	 ah sorry joal 
[16:01:34] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate should extract event time from events and produce to kafka with timestamp - https://phabricator.wikimedia.org/T219513 (10Ottomata)
[16:01:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10Ottomata)
[16:01:39] <elukey>	 just done
[16:01:40] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:02:34] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:02:37] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:02:42] <joal>	 elukey: the error is super weird
[16:02:50] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[16:04:28] <elukey>	 joal: I am wondering if it is a jetty bug
[16:04:44] <joal>	 elukey: good thing is, we can replicate
[16:04:51] <joal>	 elukey: or at least we think we can
[16:06:06] <elukey>	 it is not that uber bot
[16:06:12] <joal>	 nope
[16:06:34] <elukey>	 and I don't think it is on the historicals since restarting the brokers fixes the problem
[16:08:42] <joal>	 I have that feeling that historical might be related - brokers timeout on historical
[16:16:28] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Ottomata) a:05Ottomata→03elukey
[16:52:17] <joal>	 elukey: I assume historical http threads are related to druid-broker-clisent http threads
[16:54:31] <elukey>	 joal: ?
[16:55:41] <joal>	 elukey: broker uses a http client to talk to historical, qwith default number of conns of 20
[16:56:09] <elukey>	 what is the setting?
[16:56:43] <joal>	 elukey: druid.broker.http.numConnections
[16:57:29] <joal>	 elukey: druid.server.... is the http-server (answering users when broker, or answering groker when historical)
[16:57:54] <joal>	 broker is bot a server (for users) and a client (for historical)
[16:57:58] <joal>	 +h
[16:57:59] <joal>	 both
[17:00:29] <elukey>	 ah interesting druid.server
[17:00:30] <elukey>	 okok
[17:00:34] <elukey>	 I missed the naming
[17:00:58] <joal>	 np :)
[17:01:51] <elukey>	 my understanding though is that druid.server.http.numThreads is used on both historical and broker, but those are separate pools
[17:02:16] <joal>	 elukey: same setting, different files
[17:02:29] <elukey>	 yes we use it in various runtime properties
[17:02:33] <joal>	 yup
[17:02:38] <elukey>	 I am not getting your point though
[17:03:00] <elukey>	  <joal> elukey: broker uses a http client to talk to historical, qwith default number of conns of 20
[17:03:04] <elukey>	 this started from here
[17:03:39] <elukey>	 so this setting seems not picked up by the broker, if I remove it from runtime.properties I obtain the same number of threads
[17:03:42] <elukey>	 :(
[17:03:58] <joal>	 so, talking about http-servers (druid.server.http.numThreads) -- broker uses this one to answer users queries, and historical uses this one to answer brokers queries
[17:05:25] <elukey>	 and they also have druid.server.http.queueSize
[17:06:19] <joal>	 yes - this is related to those components serving requests
[17:07:04] <joal>	 the broker however, has a particular role, since from a request (served using the server), it generates requests to historical (through a http-client)
[17:07:37] <elukey>	 yep
[17:07:52] <joal>	 And this broker-http-client is configured through  druid.broker.http.numConnections
[17:08:30] <elukey>	 yep
[17:08:47] <elukey>	 there is also the read timeout that I wanted to reduce
[17:08:47] <joal>	 if brokers have globally more capacity to query than historical have to answer, they could be overwhelmed I guess
[17:09:09] <elukey>	 especially with long timeouts
[17:10:10] <joal>	 Testing this will be complicated :(
[17:10:30] <joal>	 elukey: testing session tomorrow morning?
[17:11:02] <elukey>	 sure
[17:11:04] <elukey>	 I started https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502814/2/hieradata/role/common/druid/public/worker.yaml
[17:11:07] <elukey>	 with some thoughts
[17:12:00] <joal>	 elukey: I actually think out druid architecture is wrong: sharing broker and historical seems incorrect - Let's talk tomorrow :)
[17:12:39] <joal>	 (i know, there is no good answer to that given we have 3 machines)
[17:12:47] <joal>	 Gone for diner team, back after
[18:08:18] * elukey off!
[18:20:54] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Jdlrobson)
[18:33:49] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10QuickSurveys: QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Isaac) > Is it possible that the link to the survey is being shared outside a QuickSurvey (e.g. social media)?  @Jdlrobson Good point but I'm pretty sur...
[19:01:34] <wikibugs>	 (03PS1) 10Joal: Refactor refinery/python/druid.py [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111)
[19:15:47] <wikibugs>	 (03PS2) 10Joal: [WIP] Refactor refinery/python/druid.py [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111)
[19:38:34] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata)
[19:52:36] <wikibugs>	 10Analytics, 10EventBus, 10Operations, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) On ganeti1003: ` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=network -o...
[20:06:39] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: attempt to backfill eventlogging data from eventlogging-client-side topic into per schema topics - https://phabricator.wikimedia.org/T220421 (10Nuria) ahem, yes, at "glacial" speed as you can see
[20:58:20] <wikibugs>	 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata)
[20:59:08] <wikibugs>	 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) Here's an strace of the staging worker just before it is killed: https://gist.github.com/ottomata/ce1b3afc6213b35678b65b6151...
[21:14:09] <wikibugs>	 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) There is a 5 second delay between these two syscalls:  ` 21:05:00.233006 sendmsg(12, {msg_name={sa_family=AF_INET, sin_port=...
[21:33:58] <wikibugs>	 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Pchelolo) These 2 heartbeats are more than 10s apart:  ` 21:04:48.628548 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21 21:04:59.3021...
[21:44:40] <wikibugs>	 10Analytics, 10EventBus, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Pchelolo) Hm, the `sendmsg` that happened before a 5-second delay was for port 9125, which is statsd_exporter.
[23:58:18] <wikibugs>	 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10User-DannyS712: Beta cluster: MassMessage fails with PHP fatal error - https://phabricator.wikimedia.org/T220662 (10Legoktm) p:05Triage→03Unbreak! I haven't verified myself, but if this is an issue, it's UBN.
[23:58:29] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10WMF-JobQueue, and 2 others: Beta cluster: MassMessage fails with PHP fatal error - https://phabricator.wikimedia.org/T220662 (10Legoktm)