[05:50:17] <elukey>	 !log re-run webrequest-load-wf-misc-2018-5-27-22, webrequest-load-wf-text-2018-5-28-2, webrequest-load-wf-upload-2018-5-28-3
[05:50:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:28:34] <elukey>	 joal: o/
[06:28:48] <elukey>	 so I'd like to start my monday with reimaging druid1003 to Stretch :)
[06:34:03] <elukey>	 for me moment I just stopped everything on the node, and set druid1002 in superset
[06:56:52] <joal>	 works for me elukey :)
[06:57:41] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4145484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1003.eqiad.wmnet']...
[06:58:05] <elukey>	 joal: just started!
[07:12:06] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4235479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1003.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['druid1003.eq...
[07:32:42] <elukey>	 joal: druid1003 back in the cluster with stretch :)
[07:35:38] <wikibugs_>	 (03CR) 10Elukey: [C: 031] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169 (owner: 10Joal)
[07:39:45] <elukey>	 the only weird thing is that zookeeper didn't pick up the jvm opts
[07:39:46] <elukey>	 mmmm
[07:43:54] <elukey>	 ah no  the daemon started before the new config was applied
[07:43:56] <elukey>	 all good now
[07:46:31] <elukey>	 I can see the coordinator leader (druid1001) telling druid1003 to load segments
[07:47:37] <elukey>	 joal: if you are ok I'll do one druid node per day
[07:48:03] <elukey>	 maybe even two, seems feasible
[07:48:19] <elukey>	 so by the end of the week we'll have only stretch
[07:48:42] <elukey>	 (afk for a bit!)
[08:02:05] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4235561 (10elukey)
[08:52:09] <elukey>	 joal: very interesting.. I noted that the /var/lib/druid's segment cache on druid1003 was still 89M, so I took a look to historical logs and they were full of errors related to can't find host for analytics-hadoop etc..
[08:52:18] <elukey>	 that was really strange since the hadoop config is deployed
[08:52:41] <elukey>	 so I restarted the historical and after a minute it was adding 6G to /var/lib/druid :D
[08:53:05] <elukey>	 so I believe that puppet needs to be updated for the first run, the hadoop config needs to be there before druid
[08:53:37] <elukey>	 otherwise if the historicals start and the hadoop-client doesn't find the hadoop config they will won't be able to access deep storage
[08:53:42] <elukey>	 (at least, this is my understanding)
[09:07:56] <elukey>	 all right now we are at ~50G of segment cache, goood
[09:08:08] <elukey>	 it will take a bit, it should reach much more than that
[09:10:00] <elukey>	 I also restarted overlord/middlemanager, they would have been impacted by the same issue during the first indexation
[09:12:17] <elukey>	 from puppet though it should have been handled correctly
[10:35:12] <elukey>	 going afk for ~2h, ttl!
[10:50:55] <wikibugs_>	 (03PS2) 10Joal: Correct sqoop scripts problems [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169
[12:07:30] <joal>	 elukey: For when you're back - I have the feeling somehing not good with drdui1003
[12:07:33] <joal>	  :(
[12:26:15] <wikibugs_>	 10Analytics-Tech-community-metrics: Include gerrit DB's "author_bot" field also in the gerrit_demo DB - https://phabricator.wikimedia.org/T184907#4236241 (10Aklapper)
[12:26:19] <wikibugs_>	 10Analytics-Tech-community-metrics, 10Developer-Relations: Have "Last Attracted Developers" information for Gerrit automatically updated / Integrate new demography panels in GrimoireLab product - https://phabricator.wikimedia.org/T151161#4236240 (10Aklapper)
[12:27:30] <wikibugs_>	 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Epic: Visualization/data regressions after moving from korma.wmflabs.org to wikimedia.biterg.io - https://phabricator.wikimedia.org/T137997#4236243 (10Aklapper)
[12:27:33] <wikibugs_>	 10Analytics-Tech-community-metrics, 10Regression: Exclude upstream repositories in the default view on wikimedia.biterg.io (by setting up "Projects" once Bestiary is available?) - https://phabricator.wikimedia.org/T146135#4236242 (10Aklapper)
[12:45:11] <elukey>	 joal: I am back!
[12:45:13] <elukey>	 what's wrong?
[12:46:12] <joal>	 Hi elukey! I've seen back-and-forth loading on druid1003 from coord-ui :(
[12:48:26] <elukey>	 joal: so atm the segment cache is ~344G, I think it still need a bit more time before reaching the other ones (~500G)
[12:48:42] <elukey>	 unless you've seen failures here in there 
[12:48:56] <joal>	 elukey: no failures, just not-fully-loaded
[12:49:06] <joal>	 I was wondering if it meant failures or not
[12:49:27] <joal>	 looks like that it doesn't - So all good :)
[12:49:33] <joal>	 but I prefered to double check
[12:50:26] <elukey>	 nono please always double check :) I am not 100% sure but this morning I didn't take any precaution (on purpose) to save the /var/lib/druid segment cache
[12:51:03] <elukey>	 because in my mind the coordinator should realize that a node is empty and it should force it populate segments
[12:51:53] <elukey>	 in /var/lib/druid on druid1001 I can see
[12:52:04] <elukey>	 /var/lib/druid/clickhouse-otto-test -> ~500G :P
[12:52:14] <elukey>	 /var/lib/druid/segment-cache -> ~470G
[12:52:43] <elukey>	 so ~130G missing from druid1003 in theory before stop loading segments
[12:52:46] <joal>	 elukey: maybe we should drop clickhouse?
[12:52:58] <joal>	 elukey: I still have it in the back of my mind though
[12:53:14] <elukey>	 the plan that I have is to wipe all the hosts so we'll start from a clean status on each
[12:53:27] <elukey>	 already had a chat with Andrew and that dataset should not be relevant
[12:55:13] <elukey>	 !log re-run webrequest-load-wf-misc-2018-5-28-10
[12:55:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:58:11] <elukey>	 joal: if what I wrote makes you still doubtful please tell me and I'll try to dig more into it :)
[12:58:58] <joal>	 elukey: you know me - If I doubt, I start crying on your shoulder ;)
[13:05:00] <wikibugs_>	 (03CR) 10Sahil505: "Hey, so I tried semantic multiple times to achieve the mock design of the footer but it is not working as good as the custom CSS that I wr" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505)
[13:17:57] <wikibugs_>	 (03PS3) 10Sahil505: Upgraded footer UI/design [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672)
[13:58:38] <wikibugs_>	 (03CR) 10Mforns: [V: 032 C: 032] "Looks great :]" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505)
[14:04:10] <wikibugs_>	 (03Merged) 10jenkins-bot: Upgraded footer UI/design [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505)
[14:17:33] <wikibugs_>	 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4236585 (10CristianCantoro) Ok, I have written a script that uses [[ https://www.gnu.org/software/parallel/ | GNU Parallel ]] to process multiple days at the same time. Using 6 cores I was able to process 23 days wo...
[14:28:41] <wikibugs_>	 (03PS1) 10Mforns: Prepare for release 2.2.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/435791
[14:29:13] <wikibugs_>	 (03CR) 10Mforns: [V: 032 C: 032] "DEPLOYING" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/435791 (owner: 10Mforns)
[14:30:43] <wikibugs_>	 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4236620 (10ArthurPSmith) Hi - my most recent response was following MisterSynergy's comment on Denny's proposed questions, and specifically the meaning of "processes that in bulk extrac...
[14:34:12] <wikibugs_>	 (03PS1) 10Mforns: Release 2.2.6 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/435792
[14:46:03] <wikibugs_>	 (03PS3) 10Joal: Correct sqoop scripts problems [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169
[15:02:02] <mforns>	 ah! got kicked from standup!
[15:02:20] <joal>	 well, you can sit mforns ;)
[15:02:36] <mforns>	 xD
[15:28:43] <mforns>	 what's up with her voice?
[15:31:31] <mforns>	 ok, I survived :D
[15:31:46] <joal>	 mforns: https://en.wikipedia.org/wiki/Vocoder
[15:32:13] <mforns>	 aha..
[15:59:01] <elukey>	 joal: going to swap conf1002 with conf1005 in a bit
[15:59:03] <elukey>	 (zookeeper)
[15:59:33] <joal>	 elukey: here to help in needed - please let me know if you want me to have a dedicated look to something
[15:59:54] <elukey>	 should be ok, a lot of restarts but nothing more (hopefully)
[16:00:09] <joal>	 k
[16:00:21] <joal>	 elukey: here nonetheless, please don't hesitate to ask
[16:12:53] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 4.637e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[16:13:33] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.376e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[16:13:34] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.376e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[16:13:42] <elukey>	 lovely
[16:15:22] <elukey>	 ah yes this is burrow that got restarted
[16:15:24] <elukey>	 uffff
[16:17:53] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad
[16:19:03] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[16:19:32] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[16:21:18] <wikibugs_>	 (03PS3) 10Joal: Update mediawiki-history stats [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481)
[16:27:43] <elukey>	 joal: new zk cluster running on conf1003,1004,1005
[16:27:53] <elukey>	 going to start the kafka/hadoop roll restart in a bit
[17:23:05] <elukey>	 kafka100[1-3] (Job queues) completed
[17:43:40] <elukey>	 atm I am restarting Kafka analytics and jumbo
[18:09:22] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[18:09:32] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[18:13:18] <elukey>	 checking --^
[18:14:46] <mforns>	 fdans, you still around?
[18:14:53] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 471.9 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[18:15:52] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 484.6 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[18:16:25] <elukey>	 !log restart kafka mirror maker on kafka1012->14 - failed after the last round of kafka restarts
[18:16:26] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:22:25] <elukey>	 ok all the kafka clusters done
[18:22:39] <elukey>	 since it is a bit late I'll do the hadoop masters tomorrow
[18:22:42] <elukey>	 Cc: joal 
[18:22:51] <joal>	 hanks elukey :)
[18:23:16] <elukey>	 zookeeper on conf1002 is down (and masked) so it will not pop up again
[18:23:42] <elukey>	 will check later on to make sure that nothing explodes :)
[18:24:18] <joal>	 elukey: Have a good evening elukey :)
[18:41:31] <fdans>	 mforns: I was doing laundry! you still need me?
[18:41:36] <mforns>	 hey fdans
[18:41:38] <mforns>	 :]
[18:42:05] <mforns>	 yea, I have a problem, puppet does not deploy what I pushed to wikistats release...
[18:42:16] * elukey afk!
[18:42:23] <mforns>	 bye elukey :]
[18:43:58] <fdans>	 mforns: looking
[18:44:10] <mforns>	 fdans, wanna cave?
[18:44:16] <fdans>	 yep!
[18:44:19] <mforns>	 omw
[18:46:22] <wikibugs_>	 (03CR) 10Mforns: [C: 032] Release 2.2.6 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/435792 (owner: 10Mforns)
[18:51:22] <elukey>	 !log rerun webrequest-load-wf-upload-2018-5-28-14
[18:51:23] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:54:48] <wikibugs_>	 10Analytics, 10Research: Provide data dumps in the Analytics Data Lake - https://phabricator.wikimedia.org/T186559#4237217 (10Neil_P._Quinn_WMF)
[21:05:57] <wikibugs_>	 (03CR) 10Mforns: [C: 031] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481) (owner: 10Joal)