[02:10:37] <wikibugs>	 10Analytics, 10Analytics-SWAP: Upgrade R in SWAP notebooks to 3.4+ - https://phabricator.wikimedia.org/T222933 (10Groceryheist)
[04:19:27] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics
[05:32:32] <elukey>	 morning!
[05:32:36] <elukey>	 just restarted eventlogging
[05:32:43] <elukey>	 there were some consume errors
[05:32:50] <elukey>	 and afaics there is some lag 
[05:37:28] <elukey>	 mmmm something is not woring ok 
[05:40:11] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?panelId=54&fullscreen&orgId=1&from=now-30d&to=now
[05:40:23] <elukey>	 as precaution, I have executed preferred-replica-election on jumbo
[05:40:29] <elukey>	 1002 seems used way more than others
[05:45:17] <elukey>	 I am seeing kafka.coordinator [WARNING] Heartbeat failed for group eventlogging_processor_client_side_00 because it is rebalancing
[05:45:27] <elukey>	 for example often in the eventlogging's processors logs
[05:46:54] <elukey>	 and camus complains about
[05:46:54] <elukey>	 Topic not fully pulled, max task time reached at 2019-05-09
[05:46:54] <elukey>	 T22:06:04.000Z, pulled 3872 records
[05:46:55] <stashbot>	 T22: Identify features Bugzilla users would miss in Phabricator - https://phabricator.wikimedia.org/T22
[05:46:58] <elukey>	 for example
[05:50:06] <elukey>	 kafka bytes out is super spiky for 1002
[05:50:16] <elukey>	 but that has been going for a while
[05:50:28] <elukey>	 there must be a client that pulls data in a weird/bursty way
[05:59:08] <elukey>	 so the first occurrence of "Topic not fully pulled, max task time reached at 2019-05-09T21:48:34.000Z, pulled 2 records"
[05:59:23] <elukey>	 was at May  9 22:07:05 UTC, for camus eventlogging
[06:06:59] <elukey>	 and eventlogging processors started to fail the consumer group heatbeats (with the assigned broker leader) intermittently since at least May 8
[06:07:45] <elukey>	 there is a clear trend of kafka clients lagging a bit, but I still haven't found a clear cut with the last alarms
[06:12:43] <elukey>	 ah we also didn't refine the last hour that errored for upload yesterday
[06:12:46] <elukey>	 sigh
[06:30:14] <elukey>	 !log refine with higher loss threshold webrequest upload 2019-5-8-18
[06:30:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:10:26] <elukey>	 ok so all false positives after refinement
[07:10:33] <elukey>	 one thing fixed :D
[07:14:53] <elukey>	 trying now to run a el processor with max_poll_records=100 (default 500)
[07:15:24] <elukey>	 since it fails the heatbeat with kafka, it should mean that the poll() spend too much time processing?
[07:19:58] <joal>	 Morning elukey - From the alert emails, I guess it's not been a quiet past 2 days :S
[07:20:38] <elukey>	 joal: bonjour! Now it is you and Andrew with a curse!  :D
[07:20:42] <elukey>	 Andrew's one is worse
[07:20:52] <elukey>	 kidding :)
[07:20:56] <joal>	 :)
[07:20:59] <elukey>	 small things, nothing horrible
[07:21:20] <elukey>	 but! One positive thing is that the RPC alarms + hdfs audit logs works well
[07:21:24] <joal>	 Is there anything I should start focusing on, or reading emails is a good start?
[07:21:31] <joal>	 \o/
[07:22:10] <elukey>	 yesterday we had another spark partitioning issue causing a ton of temp files 
[07:22:18] <elukey>	 but we caught it after 5 mins
[07:22:28] <joal>	 nice !!
[07:22:34] <elukey>	 still not a good fence but better than seeing the HDFS master going down :D
[07:22:49] <elukey>	 also, the hdfs-audit.log on the master showed user + RPC action
[07:22:59] <elukey>	 that was.. create file tmp etc.
[07:23:03] <elukey>	 :)
[07:23:24] <elukey>	 so please read emails, the only outsting problem seems to be related to eventlogging
[07:23:33] <elukey>	 but not really sure what now :)
[07:23:38] <joal>	 ok - reading
[07:23:43] <joal>	 Thanks for the heads up
[07:25:31] <elukey>	 check webrequest status seems ok now, should recover in the next minutes
[07:35:07] <icinga-wm>	 RECOVERY - Check the last execution of check_webrequest_partitions on an-coord1001 is OK: OK: Status of the systemd unit check_webrequest_partitions
[07:40:41] <elukey>	 going afk for a bit!
[07:49:17] <wikibugs>	 (03CR) 10Joal: [V: 03+1] "Ping @fdans please :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal)
[08:02:54] <elukey>	 back!
[08:22:03] <elukey>	 I am trying one setting for the kafka python consumers of Eventlogging
[08:22:12] <elukey>	 namely setting max_poll_records to 100 (default is 500)
[08:22:18] <elukey>	 to see if the rebalances decrease
[08:23:53] <elukey>	 looks good so far
[08:25:26] <elukey>	 a bit better, seeing less rebalances but still they happen
[08:44:40] <elukey>	 another try made, raising timeouts
[08:46:39] <elukey>	 nope
[08:46:46] <elukey>	 I suspect though that the settings are not applied
[08:46:46] <joal>	 :(
[08:47:40] <elukey>	 anyway joal, I think that we have a not-new issue with eventlogging, namely the processors failing to healtcheck with the broker (leader of their consumer group) and triggering rebalances
[08:48:01] <elukey>	 and the last alerts from camus + eventlogging analytics refine
[08:51:50] <joal>	 elukey: could it be related to version mismatch between kafka and consumers?
[08:52:26] <elukey>	 in theory no, kafka python is 1.4.6 that is recent
[08:53:10] <elukey>	 I am trying now one last change
[08:53:26] <elukey>	 namely no specific timeouts (seems to get worst with these) and poll size 50 
[08:53:37] <elukey>	 that should be the max amount of records to fetch for each poll
[08:55:37] <elukey>	 joal: look how beautiful this graph is
[08:55:38] <elukey>	 https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-30d&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_consumer_client_side_events_log_00&panelId=1&fullscreen
[08:55:43] <elukey>	 seems the throne of GOT :D
[08:56:49] <joal>	 :)
[08:59:39] <elukey>	 my current theory is that processing of some events became heavier for some reason
[08:59:54] <joal>	 hm
[09:00:17] <elukey>	 so kafka-python poll() and spend a lot of time on processing, causing a timeout to occur that triggers the rebalance
[09:00:26] <elukey>	 all the consumer group is affected
[09:00:33] <elukey>	 but usually once it stabilize it is good
[09:00:40] <elukey>	 but I keep seeing constant churning
[09:00:52] <elukey>	 so it means that processors are failing one by one periodically
[09:01:21] <joal>	 there is unbalanced topics elukey 
[09:01:35] <elukey>	 yes 1002 is serving more traffic
[09:01:53] <elukey>	 I tried a preferred-replica-election today but not really effective
[09:01:58] <joal>	 elukey: https://grafana.wikimedia.org/d/000000027/kafka?panelId=12&fullscreen&orgId=1
[09:02:10] <joal>	 I'd say 1003 receives more traffic
[09:02:55] <elukey>	 joal: https://grafana.wikimedia.org/d/000000027/kafka?panelId=54&fullscreen&orgId=1
[09:03:09] <joal>	 Nice :)
[09:03:21] <joal>	 1002 get more messages, but 1003 gets more volume!
[09:04:19] <elukey>	 the interesting thing is that your graph is constant for the past 30d
[09:04:22] <elukey>	 mine is not
[09:04:25] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?panelId=54&fullscreen&orgId=1&from=now-30d&to=now
[09:05:00] <elukey>	 from the 23rd there is a neat increase
[09:05:06] <elukey>	 that of course does not match with EL's lg
[09:05:08] <elukey>	 *lag
[09:05:08] <elukey>	 sigh
[09:17:44] <wikibugs>	 10Analytics, 10Analytics-Kanban: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) p:05Triage→03High
[09:22:09] <wikibugs>	 10Analytics, 10Analytics-Kanban: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) I tried with the following changes:  https://gerrit.wikimedia.org/r/#/c/509341/ https://gerrit.wikimedia.org/r/#/c/operations/puppe...
[09:22:18] <elukey>	 joal: added some thoughts --^
[09:22:33] <elukey>	 my question is: are the recent refine failures due to --^ or something else?
[09:22:53] <joal>	 And actually elukey, bytes-in is higher for both 1002 and 1004 when looking longer time range
[09:33:07] <joal>	 elukey: also I think there is a partition-replication imbalanced in kafka
[09:33:19] <joal>	 This deviation https://grafana.wikimedia.org/d/000000027/kafka?panelId=12&fullscreen&orgId=1 doesn't make sense
[09:39:02] <joal>	 elukey: something else - Spike in eventlogging_VirtualPageView schema between 21:20 and 23:35
[09:39:54] <joal>	 elukey: same spike in eventlogging_CitationUsagePageLoad
[09:40:11] <joal>	 elukey: this feels like nuria backfilling eventlogging data, no?
[09:42:22] <joal>	 hm - mabe not - seems that other eventlogging schema don't show the pattern
[09:45:04] <elukey>	 joal: I had the same thought but didn't find any trace of it.. we can check on the stat boxes?
[09:45:37] <elukey>	 sorry better - what would be the current leading theory?
[09:45:53] <elukey>	 something ongoing affecting the refinement, or something happened that messed it up?
[09:58:23] <elukey>	 joal: it seems, from the kafka logs, that the broker leader of the cgroup is jumbo 1001
[09:58:44] <elukey>	 I am wondering if it could make sense to restart kafka on it to see if the issue still persist on the next leader
[09:59:07] <elukey>	 kinda desperate attempt I know
[10:00:27] <elukey>	 joal: ah!!!!!
[10:00:34] <elukey>	 https://tools.wmflabs.org/sal/analytics?p=0&q=&d=2019-04-30
[10:00:37] <elukey>	 take a look!
[10:00:47] <elukey>	 PERFECT MATCH
[10:00:53] <elukey>	 that's it, 1.4.6 is at fault
[10:01:05] <elukey>	 I didn't check in the analytics SAL earlier on
[10:01:12] <elukey>	 only in the production one
[10:01:54] <wikibugs>	 10Analytics, 10Analytics-Kanban: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) SAL for the 04-30:  https://tools.wmflabs.org/sal/production?p=0&q=&d=2019-04-30 https://tools.wmflabs.org/sal/analytics?d=2019-04-...
[10:11:13] <elukey>	 so we surely need to either find what's wrong or rollback to 1.4.3
[10:11:20] <elukey>	 (that needs to be rebuilt)
[10:13:39] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) The upgrade of python-kafka to 1.4.6 on eventlog1002 coincides very well with T222941 :(
[10:15:31] <elukey>	 IIRC andrew should work today
[10:17:21] <elukey>	 https://github.com/dpkp/kafka-python/blob/master/CHANGES.md#145-mar-14-2019
[10:17:25] <elukey>	 This release is primarily focused on addressing lock contention and other coordination issues between the KafkaConsumer and the background heartbeat thread that was introduced in the 1.4 release.
[10:26:08] <wikibugs>	 10Analytics, 10Analytics-Kanban: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) https://github.com/dpkp/kafka-python/issues/1418 seems related, some workarounds are listed.
[10:34:45] <wikibugs>	 10Analytics, 10Analytics-Kanban: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) Rebuilt python-kafka_1.4.3-1_all.deb and uploaded to eventlog1002 in case we decide to rollback.
[10:55:56] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) >>! In T221848#5137318, @MoritzMuehlenhoff wrote: > We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4...
[11:08:37] <jbond42>	 elukey: may be easier to chat here, looks like it has been failied in nagios for ~1.5 days
[11:09:35] <elukey>	 jbond42: yep yep I was about to write, Marcel did some changes and IIUC yesterday they were supposed to be working fine
[11:09:38] <elukey>	 apparently not
[11:10:25] <elukey>	 I am seeing that the ExecStart contains &&, not sure if systemd parses those like bash
[11:10:36] <elukey>	 I will replace it with a bash script
[11:10:59] <jbond42>	 ack feel free to ping me for review
[11:11:34] <elukey>	 jbond42: just as note, what command did you run?
[11:11:48] <elukey>	 the saltrotate one?
[11:12:17] <jbond42>	 elukey: https://phabricator.wikimedia.org/P8509 command and output
[11:12:58] <elukey>	 ah!
[11:13:00] <elukey>	 jbond42: <3
[11:13:20] <elukey>	 yeah pretty sure that systemd doesn't like it
[11:13:40] <jbond42>	 yes i think your right
[11:17:25] <elukey>	 going to send a patch after lunch!
[11:21:10] <jbond42>	 ack
[11:21:57] <elukey>	 going afk for lunch!
[13:02:53] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5170415, @Ottomata wrote: > I think there are a few more branches: >  > - prod...
[13:18:29] <mforns>	 hey team!
[13:18:49] <elukey>	 o/
[13:18:56] <elukey>	 mforns: I have a code review for you
[13:19:08] <mforns>	 elukey, ok
[13:19:16] <elukey>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509387/
[13:19:30] <mforns>	 elukey, I just saw you merged the one for AQS deploy, thanks!
[13:19:30] <elukey>	 when you are ready, not super urgent :)
[13:19:39] <elukey>	 mforns: nope I didn't!
[13:20:15] <ottomata>	 elukey erb @ vs puppet $ :)
[13:21:18] <elukey>	 ah right!
[13:21:19] <elukey>	 fixing
[13:22:57] <elukey>	 ottomata: done :)
[13:23:08] <mforns>	 oh elukey, my mistake, was looking at another patch before
[13:23:50] <elukey>	 mforns: we can merge the AQS on monday if you are ok, since it'll require a roll restart of aqs
[13:23:50] <mforns>	 elukey, the one I was talking about: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509150/1/hieradata/role/common/aqs.yaml
[13:25:23] <mforns>	 elukey, the patch you passed me looks good to me, though I don't know exactly how you pass "parameters" to the template. Is it implicit? 
[13:26:32] <mforns>	 ok elukey no prob
[13:27:48] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Operations, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10Ottomata) That'd be fine!
[13:28:53] <elukey>	 mforns: yes exactly
[13:29:04] <elukey>	 the erb teplates grabs them from the scope in which the file resource is defined
[13:29:09] <elukey>	 in this case, the variables of the class
[13:29:42] <mforns>	 ok, then LGTM
[13:30:01] <elukey>	 ottomata: when you have time I'd like to discuss with you what to do in the short term for eventlog1002
[13:30:55] <elukey>	 mforns: also, was it ok to run the saltrotate today manually?
[13:31:21] <mforns>	 elukey, yes, saltrotate did (or should have done nothing today)
[13:31:30] <mforns>	 lemme check
[13:31:51] <elukey>	 it didn't run yesterday though
[13:31:58] <elukey>	 (EU evening I mean)
[13:32:12] <elukey>	 but yeah better triple checking :D
[13:32:54] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10Ottomata) > I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id,...
[13:32:55] <mforns>	 elukey, everything is good
[13:33:10] <mforns>	 salt will be only rotated at end of quarter
[13:33:14] <elukey>	 gooood
[13:33:37] <elukey>	 mforns: so I can re-run it now to see that everything works?
[13:34:45] <ottomata>	 elukey:  so likely those camus checker failuer emails about those high volume EL topics were do to this consumer rebalance problem, yes?
[13:35:06] <elukey>	 ottomata: not sure, it has been ongoing for a while, but it is the only weird thing that I found
[13:35:09] <elukey>	 :(
[13:35:38] <elukey>	 I tried some parameter today for kafka python but they were all making it worse
[13:35:46] <elukey>	 or not changing much
[13:36:49] <ottomata>	 ok
[13:36:53] <ottomata>	 am reading this kafka-python bug
[13:39:14] <ottomata>	 elukey:  you got that 1.4.3 .deb from buster?
[13:40:03] <ottomata>	 I say lets install it anad see if it fixes the problem
[13:40:08] <elukey>	 ottomata: nono I have simply used the gerrit repo, then did git reset --hard 1.4.3 on master and to your 1.4.3 release commit on debian
[13:40:11] <elukey>	 then built
[13:40:20] <ottomata>	 oh
[13:40:38] <elukey>	 but I wasn't sure if it was the correct source/procedure
[13:40:38] <ottomata>	 but buster has a backport for gilles bug to 1.4.3?
[13:40:38] <mforns>	 elukey,sure
[13:40:41] <elukey>	 so didn't attempt
[13:41:04] <mforns>	 elukey, can re-run, it should do nothing. Just print some logs
[13:41:05] <ottomata>	 OHHh
[13:41:05] <ottomata>	 sorry
[13:41:17] <ottomata>	 moritz was just saying that 1.4.3 is in buster
[13:42:31] <elukey>	 ottomata: IIUC he proposed to open a bug to debian upstream if our package worked no?
[13:43:10] <ottomata>	 yes, sorry
[13:43:12] <ottomata>	 misremembered
[13:43:13] <ottomata>	 just read
[13:43:15] <ottomata>	 better
[13:43:29] <elukey>	 we could definitely grab the buster source, rebuild for strech (even if no C parts are involved afaics), apply the patch and test
[13:43:31] <ottomata>	 the 1.4.3 that you put there is whta we were runnign before
[13:43:42] <ottomata>	 nono, we have packagae, we can do it too.
[13:43:48] <ottomata>	 hmm
[13:44:15] <ottomata>	 elukey:  i'll try and backport this fix for gille,s make a .deb, then we can install on eventlog1002 anad see if it fixes our problem
[13:44:29] <ottomata>	 if it should then we can upload to apt and go with this one?
[13:44:30] <elukey>	 ottomata: 1.4.3 + the patch?
[13:44:33] <ottomata>	 yes
[13:44:36] <elukey>	 +1
[13:44:42] <ottomata>	 k
[13:47:26] <icinga-wm>	 RECOVERY - Check the last execution of refinery-eventlogging-saltrotate on an-coord1001 is OK: OK: Status of the systemd unit refinery-eventlogging-saltrotate
[13:48:39] <elukey>	 \o/
[13:48:47] <ottomata>	 oh elukey 
[13:48:47] <ottomata>	 hm
[13:48:49] <ottomata>	 ?
[13:48:57] <ottomata>	 that PR thata gilles wanted is already in 1.4.3
[13:49:14] <ottomata>	 OPH nonono
[13:49:15] <ottomata>	 sorry
[13:49:19] <ottomata>	 i just branched wrong.
[13:58:35] <elukey>	 ottomata: when  you are ready to install gimme something that I'll merge a change to remove the max_poll_records
[13:58:51] <ottomata>	 ok
[14:09:42] <wikibugs>	 (03CR) 10Elukey: "> Given that it's not private data, analytics seems fine. Do we have" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509016 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[14:14:43] <ottomata>	 ok elukey  am ready to install
[14:16:40] <elukey>	 merging
[14:17:08] <elukey>	 you can install, I'll merge + run puppet to refresh ok?
[14:17:17] <ottomata>	 ok
[14:17:44] <ottomata>	 !log downgrading python-kafka from 1.4.6-1~stretch1 to 1.4.3-2~wmf0 on eventlog1002 - T221848
[14:17:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:17:47] <stashbot>	 T221848: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848
[14:17:49] <ottomata>	 done elukey
[14:20:57] <elukey>	 done!
[14:21:56] <ottomata>	 ok elukey  now we wait?
[14:22:43] <elukey>	 ottomata: still seeing the rebalance in the processor logs
[14:23:02] <elukey>	 but those might have happened even before, so yes let's wait a bit
[14:23:35] <elukey>	 I am watching https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_consumer_client_side_events_log_00
[14:23:48] <ottomata>	 oh ya
[14:24:00] <elukey>	 a lot less rebalances for sure afaics
[14:24:02] <elukey>	 (from the logs)
[14:24:26] <ottomata>	 ha maybe the deadlock fix we just backported is the cause of this :p
[14:25:30] <elukey>	 we need eventgate! :P
[14:27:44] <wikibugs>	 (03CR) 10Fdans: [C: 03+1] "I like nice code :) Thank you for doing this Joseph, this looks and works way nicer. I think the solution of identifying the leaders is aw" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/502858 (https://phabricator.wikimedia.org/T220111) (owner: 10Joal)
[14:33:17] <elukey>	 ottomata: it seems not working :(
[14:40:03] <ottomata>	 ok elukey  let's roll back to 1.4.3 without backport
[14:40:04] <ottomata>	 and see
[14:40:25] <ottomata>	 oh to
[14:40:26] <ottomata>	 1.4.1
[14:40:28] <ottomata>	 not 1.4.3, right?
[14:40:39] <elukey>	 1.4.3 IIRC
[14:41:01] <ottomata>	 hmmm its just not ni /var/cache/apt/archives aas i'd expect
[14:41:02] <ottomata>	 hm
[14:42:26] <elukey>	 from your commits on the repo it seemed 1.4.3
[14:46:38] <ottomata>	 unless we never installed it on eventlog1002?
[14:47:04] <ottomata>	 let's try 1.4.3 without pathc first
[14:47:09] <ottomata>	 elukey:  that's the one in your homedir ya?
[14:51:02] <elukey>	 ottomata: if you trust what I did yes :D
[14:51:08] <ottomata>	 oh you rebuilt it?
[14:51:09] <ottomata>	 hmm
[14:51:36] <elukey>	 it is based on your commits on the repo so should be good
[14:51:43] <ottomata>	 ok i'm going to takethe one that is on boron just in case
[14:51:47] <ottomata>	 thata was built last year
[14:51:53] <elukey>	 ack!
[14:52:22] <elukey>	 I hope I didn't override it
[14:52:45] <ottomata>	 ok, restarting eventlogging
[14:53:10] <ottomata>	 !log restarted eventlogging with python-kafka-1.4.3
[14:53:12] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:57:45] <wikibugs>	 (03PS18) 10Fdans: Replace time range selector [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112)
[14:57:56] <fdans>	 this might be the good one
[14:58:00] <ottomata>	 elukey:  i still see heartbeats failing...
[14:58:58] <elukey>	 ottomata let's see if they crash brutally or if they are kind of stable(ish)
[14:59:17] <elukey>	 it must be that package, the regression matches 1:1 with your upgrade
[15:00:08] <ottomata>	 i'd assume so too....
[15:00:12] <ottomata>	 but ya let's watch the lag
[15:03:00] <fdans>	 milimetric: ready when you are to merge the timeselector change
[15:06:03] <elukey>	 mforns: we are investigating an issue with eventlogging that may be related to the refine failures
[15:06:14] <mforns>	 elukey, ah
[15:06:44] <elukey>	 that is https://phabricator.wikimedia.org/T222941
[15:06:45] <mforns>	 elukey, are you seeing the same error?
[15:06:58] <elukey>	 but it doesn't correlate with the occurrence of the alarm, since it started before
[15:07:43] <mforns>	 elukey, I sent a response to the alarm thread (monitor_refine_eventlogging)
[15:08:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "> Patch Set 1:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/509016 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[15:08:14] <elukey>	 mforns: yep saw it
[15:08:31] <elukey>	 mforns: but there were also errors related to missing data no?
[15:08:48] <mforns>	 I think it's a deprecated lib? maybe? I've seen that javax.mail is unmaintained since 2013 and it's kept for compat, but we should be using javax.mail-api
[15:09:04] <mforns>	 elukey, yes, the monitor was about to send a failure report
[15:09:33] <mforns>	 which is probably related to the issue you're mentioning
[15:09:37] <mforns>	 ok, I get that
[15:12:04] <elukey>	 mforns: the main problem is that the issue started ~10 days ago
[15:12:12] <elukey>	 not sure why we got those alerts only now
[15:12:16] <mforns>	 O.o
[15:12:25] <elukey>	 ottomata: it seems stable now!
[15:12:37] <ottomata>	 ok
[15:12:56] <ottomata>	 elukey:  i'm sstill setting lots of spikey lag
[15:12:57] <ottomata>	 OH
[15:12:58] <ottomata>	 soryr
[15:13:01] <ottomata>	 my graph isn't updating at latest
[15:13:37] <ottomata>	 hm so far stable but it hasn't been stable that long, want to wait a bit more
[15:13:58] <elukey>	 ah yes but I am sure we got it
[15:16:42] <ottomata>	 the heartbeats still flap tho!
[15:17:04] <elukey>	 yeah but once in a while was previous behavior
[15:17:13] <elukey>	 if it doesn't lag we are good(ish)
[15:17:51] <ottomata>	 hmm, was it also missing heartbeats  every minutish before?
[15:21:15] <ottomata>	 elukey:  what is your github username?
[15:21:32] <elukey>	 elukey 
[15:21:57] <ottomata>	 https://github.com/dpkp/kafka-python/issues/1418#issuecomment-491327542
[15:25:08] <elukey>	 super thanks
[15:27:01] <elukey>	 not sure how to handle multiple versions of the package now
[15:27:10] <elukey>	 we need to chat with gilles about it
[15:34:57] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) p:05High→03Normal Andrew deployed 1.4.3 and we are back to stable.
[15:39:23] <elukey>	 ottomata: of course I write --^ and then lag occurs
[15:39:24] <elukey>	 sigh
[15:39:43] <elukey>	 let's see how it behaves in a period of some hours
[15:42:32] <elukey>	 (but the impact is way lower now)
[15:52:49] <wikibugs>	 (03CR) 10Milimetric: [V: 03+2 C: 03+2] "Good and good and good." (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/499968 (https://phabricator.wikimedia.org/T219112) (owner: 10Fdans)
[15:53:45] <elukey>	 !log kill mediacounts-archive coordinator, chown analytics:analytics /wmf/data/archive/mediacounts + restart the coord with the analytics user
[15:53:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:56:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10Ottomata)
[15:58:11] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10serviceops, 10Services (watching): Change LVS port for eventlogging-analytics from 31192 to 33192 - https://phabricator.wikimedia.org/T222962 (10Ottomata) Hm, question.  Currently mediawiki-config ProductionServices.php has:    'eventgate-analytics' => 'http...
[15:59:52] <nuria>	 ping ottomata 
[16:11:46] <icinga-wm>	 RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics on an-coord1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics
[16:32:30] <wikibugs>	 10Analytics, 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10ori)
[16:32:48] <wikibugs>	 10Analytics, 10Performance: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10ori)
[16:40:29] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace current time range selector on Wikistats to allow for arbitrary time selections - https://phabricator.wikimedia.org/T219112 (10Milimetric) Three issues found while testing this on staging on my iphone:  https://wikistats-canary.wmflabs.org/time-sel...
[16:40:50] <milimetric>	 fdans: listed the issues I found with the selector ^.  If you're sick of it I can debug, up to you
[16:41:00] <milimetric>	 curious if you see them on your phone too
[16:51:23] * elukey off!
[17:09:02] <mforns>	 team, need to pick up my mom at the (boat?)port, will be back in a bit
[18:54:38] <wikibugs>	 10Analytics, 10Puppet: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn)
[18:55:17] <wikibugs>	 10Analytics, 10Puppet: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104 (10Dzahn) It's been a couple years and this has been called obsolete for a long time but also it's not completely removed yet.
[18:56:32] <wikibugs>	 10Analytics, 10Operations, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Dzahn)
[18:56:47] <wikibugs>	 10Analytics, 10Operations, 10Wikimedia-Logstash, 10observability: Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10Dzahn) also T152104
[19:31:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Replace current time range selector on Wikistats to allow for arbitrary time selections - https://phabricator.wikimedia.org/T219112 (10fdans) > when navigating to another area (Content/Contributing/Reading) from the Detail view, the time range control shri...
[23:34:26] <nuria>	 THE TIMERANGE IS SO COOL fdans !!!!
[23:43:34] <fdans>	 Thank youuu don’t use the right handle yet tho, I already have a fix for it