[05:50:17] !log re-run webrequest-load-wf-misc-2018-5-27-22, webrequest-load-wf-text-2018-5-28-2, webrequest-load-wf-upload-2018-5-28-3 [05:50:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:28:34] joal: o/ [06:28:48] so I'd like to start my monday with reimaging druid1003 to Stretch :) [06:34:03] for me moment I just stopped everything on the node, and set druid1002 in superset [06:56:52] works for me elukey :) [06:57:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4145484 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1003.eqiad.wmnet']... [06:58:05] joal: just started! [07:12:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4235479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1003.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['druid1003.eq... [07:32:42] joal: druid1003 back in the cluster with stretch :) [07:35:38] (03CR) 10Elukey: [C: 031] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169 (owner: 10Joal) [07:39:45] the only weird thing is that zookeeper didn't pick up the jvm opts [07:39:46] mmmm [07:43:54] ah no the daemon started before the new config was applied [07:43:56] all good now [07:46:31] I can see the coordinator leader (druid1001) telling druid1003 to load segments [07:47:37] joal: if you are ok I'll do one druid node per day [07:48:03] maybe even two, seems feasible [07:48:19] so by the end of the week we'll have only stretch [07:48:42] (afk for a bit!) [08:02:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4235561 (10elukey) [08:52:09] joal: very interesting.. I noted that the /var/lib/druid's segment cache on druid1003 was still 89M, so I took a look to historical logs and they were full of errors related to can't find host for analytics-hadoop etc.. [08:52:18] that was really strange since the hadoop config is deployed [08:52:41] so I restarted the historical and after a minute it was adding 6G to /var/lib/druid :D [08:53:05] so I believe that puppet needs to be updated for the first run, the hadoop config needs to be there before druid [08:53:37] otherwise if the historicals start and the hadoop-client doesn't find the hadoop config they will won't be able to access deep storage [08:53:42] (at least, this is my understanding) [09:07:56] all right now we are at ~50G of segment cache, goood [09:08:08] it will take a bit, it should reach much more than that [09:10:00] I also restarted overlord/middlemanager, they would have been impacted by the same issue during the first indexation [09:12:17] from puppet though it should have been handled correctly [10:35:12] going afk for ~2h, ttl! [10:50:55] (03PS2) 10Joal: Correct sqoop scripts problems [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169 [12:07:30] elukey: For when you're back - I have the feeling somehing not good with drdui1003 [12:07:33] :( [12:26:15] 10Analytics-Tech-community-metrics: Include gerrit DB's "author_bot" field also in the gerrit_demo DB - https://phabricator.wikimedia.org/T184907#4236241 (10Aklapper) [12:26:19] 10Analytics-Tech-community-metrics, 10Developer-Relations: Have "Last Attracted Developers" information for Gerrit automatically updated / Integrate new demography panels in GrimoireLab product - https://phabricator.wikimedia.org/T151161#4236240 (10Aklapper) [12:27:30] 10Analytics-Tech-community-metrics, 10Developer-Relations, 10Epic: Visualization/data regressions after moving from korma.wmflabs.org to wikimedia.biterg.io - https://phabricator.wikimedia.org/T137997#4236243 (10Aklapper) [12:27:33] 10Analytics-Tech-community-metrics, 10Regression: Exclude upstream repositories in the default view on wikimedia.biterg.io (by setting up "Projects" once Bestiary is available?) - https://phabricator.wikimedia.org/T146135#4236242 (10Aklapper) [12:45:11] joal: I am back! [12:45:13] what's wrong? [12:46:12] Hi elukey! I've seen back-and-forth loading on druid1003 from coord-ui :( [12:48:26] joal: so atm the segment cache is ~344G, I think it still need a bit more time before reaching the other ones (~500G) [12:48:42] unless you've seen failures here in there [12:48:56] elukey: no failures, just not-fully-loaded [12:49:06] I was wondering if it meant failures or not [12:49:27] looks like that it doesn't - So all good :) [12:49:33] but I prefered to double check [12:50:26] nono please always double check :) I am not 100% sure but this morning I didn't take any precaution (on purpose) to save the /var/lib/druid segment cache [12:51:03] because in my mind the coordinator should realize that a node is empty and it should force it populate segments [12:51:53] in /var/lib/druid on druid1001 I can see [12:52:04] /var/lib/druid/clickhouse-otto-test -> ~500G :P [12:52:14] /var/lib/druid/segment-cache -> ~470G [12:52:43] so ~130G missing from druid1003 in theory before stop loading segments [12:52:46] elukey: maybe we should drop clickhouse? [12:52:58] elukey: I still have it in the back of my mind though [12:53:14] the plan that I have is to wipe all the hosts so we'll start from a clean status on each [12:53:27] already had a chat with Andrew and that dataset should not be relevant [12:55:13] !log re-run webrequest-load-wf-misc-2018-5-28-10 [12:55:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:58:11] joal: if what I wrote makes you still doubtful please tell me and I'll try to dig more into it :) [12:58:58] elukey: you know me - If I doubt, I start crying on your shoulder ;) [13:05:00] (03CR) 10Sahil505: "Hey, so I tried semantic multiple times to achieve the mock design of the footer but it is not working as good as the custom CSS that I wr" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505) [13:17:57] (03PS3) 10Sahil505: Upgraded footer UI/design [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) [13:58:38] (03CR) 10Mforns: [V: 032 C: 032] "Looks great :]" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505) [14:04:10] (03Merged) 10jenkins-bot: Upgraded footer UI/design [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434971 (https://phabricator.wikimedia.org/T191672) (owner: 10Sahil505) [14:17:33] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041#4236585 (10CristianCantoro) Ok, I have written a script that uses [[ https://www.gnu.org/software/parallel/ | GNU Parallel ]] to process multiple days at the same time. Using 6 cores I was able to process 23 days wo... [14:28:41] (03PS1) 10Mforns: Prepare for release 2.2.6 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/435791 [14:29:13] (03CR) 10Mforns: [V: 032 C: 032] "DEPLOYING" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/435791 (owner: 10Mforns) [14:30:43] 10Analytics-Legal, 10WMF-Legal, 10Wikidata: Solve legal uncertainty of Wikidata - https://phabricator.wikimedia.org/T193728#4236620 (10ArthurPSmith) Hi - my most recent response was following MisterSynergy's comment on Denny's proposed questions, and specifically the meaning of "processes that in bulk extrac... [14:34:12] (03PS1) 10Mforns: Release 2.2.6 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/435792 [14:46:03] (03PS3) 10Joal: Correct sqoop scripts problems [analytics/refinery] - 10https://gerrit.wikimedia.org/r/435169 [15:02:02] ah! got kicked from standup! [15:02:20] well, you can sit mforns ;) [15:02:36] xD [15:28:43] what's up with her voice? [15:31:31] ok, I survived :D [15:31:46] mforns: https://en.wikipedia.org/wiki/Vocoder [15:32:13] aha.. [15:59:01] joal: going to swap conf1002 with conf1005 in a bit [15:59:03] (zookeeper) [15:59:33] elukey: here to help in needed - please let me know if you want me to have a dedicated look to something [15:59:54] should be ok, a lot of restarts but nothing more (hopefully) [16:00:09] k [16:00:21] elukey: here nonetheless, please don't hesitate to ask [16:12:53] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 4.637e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:13:33] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.376e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [16:13:34] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.376e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [16:13:42] lovely [16:15:22] ah yes this is burrow that got restarted [16:15:24] uffff [16:17:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [16:19:03] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [16:19:32] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [16:21:18] (03PS3) 10Joal: Update mediawiki-history stats [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481) [16:27:43] joal: new zk cluster running on conf1003,1004,1005 [16:27:53] going to start the kafka/hadoop roll restart in a bit [17:23:05] kafka100[1-3] (Job queues) completed [17:43:40] atm I am restarting Kafka analytics and jumbo [18:09:22] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [18:09:32] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [18:13:18] checking --^ [18:14:46] fdans, you still around? [18:14:53] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 471.9 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [18:15:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 484.6 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [18:16:25] !log restart kafka mirror maker on kafka1012->14 - failed after the last round of kafka restarts [18:16:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:22:25] ok all the kafka clusters done [18:22:39] since it is a bit late I'll do the hadoop masters tomorrow [18:22:42] Cc: joal [18:22:51] hanks elukey :) [18:23:16] zookeeper on conf1002 is down (and masked) so it will not pop up again [18:23:42] will check later on to make sure that nothing explodes :) [18:24:18] elukey: Have a good evening elukey :) [18:41:31] mforns: I was doing laundry! you still need me? [18:41:36] hey fdans [18:41:38] :] [18:42:05] yea, I have a problem, puppet does not deploy what I pushed to wikistats release... [18:42:16] * elukey afk! [18:42:23] bye elukey :] [18:43:58] mforns: looking [18:44:10] fdans, wanna cave? [18:44:16] yep! [18:44:19] omw [18:46:22] (03CR) 10Mforns: [C: 032] Release 2.2.6 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/435792 (owner: 10Mforns) [18:51:22] !log rerun webrequest-load-wf-upload-2018-5-28-14 [18:51:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:54:48] 10Analytics, 10Research: Provide data dumps in the Analytics Data Lake - https://phabricator.wikimedia.org/T186559#4237217 (10Neil_P._Quinn_WMF) [21:05:57] (03CR) 10Mforns: [C: 031] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481) (owner: 10Joal)