[08:34:35] !log Restart misc load job with 10% data loss error threshold [09:32:33] angry oozie is angry [12:40:40] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:42:58] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [13:07:32] Analytics: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2382422 (Nuria) [13:07:46] WOA Alarms!! \o/ [13:07:54] Analytics: Remove outdated docs regarding dashboard info - https://phabricator.wikimedia.org/T137883#2382436 (Nuria) [13:11:51] Analytics, Analytics-Kanban, Patch-For-Review: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#2382438 (elukey) [13:12:51] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690#2344329 (elukey) a:elukey [13:34:44] mobrovac: o/ time to chat about event bus restart [13:34:45] ? [13:35:02] elukey: let's go! [13:35:08] to chat, not to restart [13:35:08] :P [13:37:42] ahahaah [13:38:04] sooo how do we want to do it? [13:38:13] stop change prop just in case ? [13:38:27] and then reboot one at the time following a depool/pool procedure? [13:38:39] yeah, i think it's the safest plan [13:39:02] it shouldn't last longer than 30 mins, and we've had CP stopped for much longer periods of time than that [13:39:59] I believe that everything should be done in ~15 mins [13:40:02] more or less [13:42:52] mobrovac: let me know when you have time and I'll start the procedure (waiting for your green light for changeprop) [13:43:36] elukey: let's do it in an hour or so? [13:43:52] i've just deployed a change for changeprop and want to see it working :)))) [13:44:39] sure, ping me when you are ready :) [14:57:18] elukey: ping? [14:58:05] pong [14:58:15] :) [14:58:23] let's go? [14:58:25] CP is stopped [14:58:28] sure! [14:59:52] cool [15:04:11] mobrovac: kafka1001 depooled and kafka stopped, rebooting [15:04:19] kk [15:08:51] mobrovac: kafka1001 up and running, do you want to test it? [15:09:29] curl http://kafka1001.eqiad.wmnet:8085/v1/topics works [15:09:45] kk [15:09:51] and kafka itself is up too? [15:09:55] yep [15:10:02] k, let's proceed then [15:10:14] ok re-pooled [15:10:25] I am going to wait a couple of minutes and proceed with 1002 [15:13:09] smart :) [15:15:15] all good, proceeding with 1002 [15:18:14] rebooting kafka1002 [15:23:40] mobrovac: all done! [15:23:54] yay [15:24:03] thnx elukey! [15:28:01] mobrovac: thank you! I've also added a warning in https://wikitech.wikimedia.org/wiki/EventBus/Administration#Host_Reboot_or_Service_restart [15:28:08] to contact services before proceeding [15:28:30] very good! [15:28:31] thnx [15:30:23] also https://wikitech.wikimedia.org/wiki/Service_restarts#Kafka_brokers [15:30:34] moritzm: ---^ [15:30:55] TL;DR: kafka100[12] have 4.4 now and I updated the service restart docs [16:09:14] (Abandoned) Aklapper: This is a test commit - do not merge [analytics/refinery/source] - https://gerrit.wikimedia.org/r/279557 (owner: Madhuvishy) [16:15:20] (CR) Aklapper: "This patch has been sitting here for 13 months without a review. Is this still wanted? Can someone please decide (-1, +1, +2) and get this" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/212541 (owner: Joal) [16:15:31] (CR) Aklapper: "This patch has been sitting here for 11 months without a review. Is this still wanted? Can someone please decide (-1, +1, +2) and get this" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/225485 (owner: Ottomata) [16:27:25] Analytics, Discovery, Maps, RESTBase-Cassandra, and 2 others: Investigate and implement possible simplification of Cassandra Logstash filtering - https://phabricator.wikimedia.org/T130861#2382980 (Aklapper) Any plans to review / decide on the patch that has been sitting in Gerrit for six weeks no... [17:48:58] Friends, I'm hoping to get an extra pair of eyes on T132500 [17:48:59] T132500: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500 [17:49:36] From what I can tell, we're losing about half of webrequest logs somewhere between whatever is collected into stat1002/hive, and the fundraising kafkatee pipe. [17:54:11] Tangent: I noticed that your wikibugs configuration will echo "^Analytics-.*" tags here, but not "Blocked-on-Analytics". [21:22:44] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2383980 (awight) [21:23:21] quiet here... am I looking in the wrong channel? hrm [21:28:41] * awight feels twinge of guilt at using 15hours of stat1002 CPU time [21:29:34] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2383986 (awight) @Jgreen Thanx! Correcting in the comment, and it did make a huge di... [21:40:55] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2383992 (Jgreen) Another thing I noticed, which surely isn't good, is that somewhere... [23:18:13] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2384212 (awight) @Jgreen I get a different result using "cut": ``` zcat landingpages... [23:46:13] Analytics, Fundraising-Backlog, Blocked-on-Analytics, Fundraising Sprint Licking Cookies, Patch-For-Review: Clicktracking data not matching up with donation totals - https://phabricator.wikimedia.org/T132500#2384284 (awight) I'm not sure if this is an issue, but I see that our kafkatee puppet...