[00:09:40] (03PS18) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) [00:09:49] (03CR) 10Sbisson: Oozie job for Wikipedia Preview stats (033 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [07:09:18] (03CR) 10Elukey: Oozie job for Wikipedia Preview stats (031 comment) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [07:25:28] Good morning [07:29:26] bonjour! [07:30:16] so I think that oozie supports the multi-metastore uris only in bigtop's version [07:30:57] elukey: please be gentle, coffee has not kicked in ;) [07:31:11] ahhahaha sure sorry [07:31:18] ;) [07:31:19] I was already in code review mode [07:31:36] Let me phrase my understanding [07:31:59] I can add more words sorry [07:32:05] so I can explain the whole picture [07:32:13] (and it helps me to understand if it is right) [07:32:42] for some reason, the trick to have the metastore in HA is to [07:32:44] On our path to full HA, we have mysql (mostly done, with CNAMES and kerberos stuff) [07:32:51] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up automatic deletion/snitization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10ayounsi) >>! In T231339#6652009, @mforns wrote: > So, please, let us know if you guys have any periodic jobs that consume either of those 2 so... [07:33:00] And the we need Hive-server2 and Metastore [07:33:00] ah sorry [07:33:16] Right we're at the same spot, please go [07:34:04] the mysql part is mostly done, but there is still one caveat about the usage of "an-coord1001" in puppet etc.., since if I use analytics-hive I'll break the TLS certificate validation [07:34:23] every mysql node expose its puppet hostname cert for TLS [07:34:40] so in theory I'd need to add a specific one for analytics-mysql.eqiad.wmnet, or similar [07:35:07] Ohhhh - kerberos principal is used in TLS certs? [07:35:20] nono for bare mysql stuff only TLS [07:35:24] with regular user/pass [07:35:37] (I mean say metastore db, superset db, etc..) [07:35:45] they all have in their config an-coord1001.eqiad.wmnet [07:35:56] so in case of failover, we'll need to replace and restart [07:36:00] but it is not a big deal [07:36:07] I mean, acceptable for the moment [07:36:18] Ah - so in order to get full CNAME usage, not only kerb, we'd need CNAME TLS certs for tools to communicate to mysql [07:36:36] very accepta [07:36:39] yes exactly, but we can do it later on if needed [07:36:48] then hive :) [07:37:10] so analytics-hive is ok for the server2, and we know it, but the metastore's HA set up is weird [07:37:11] +ble, just trying to raise to your level elukey - meanwhile drinking a lot of coffee :) [07:37:25] ahahah no sorry for the brutal start of the friday, I can shut up [07:37:40] all good :) [07:38:23] elukey: can we please take a minute to review hive-server2 HA stuff (we'll move to metastore just after) [07:39:43] So, hive-server2 is behind CNAME and accepts kerberos-CNAME principal - So if an-coord1001 fails, it's a cname move to an-coord1002 and it all should work - correct? [07:40:45] yes so at the final stage, yes [07:41:01] currently analytics-hive (the kerb principal) and the CNAME point to an-coord1002 [07:41:14] and an-coord1001 is still running the "old" scheme [07:41:14] Ah right [07:41:21] yeah yeah [07:41:22] and they both use the same metastore [07:41:25] ack ack [07:42:49] but yes at the end of the journey both an-coords will run the same creds, analytics-hive, and simply flipping the CNAME will change traffic [07:43:00] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up automatic deletion/snitization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10JAllemandou) > Here's a sketch of the migration plan, mostly a reference for myself! Please raise flags if something is missing. @mforns This... [07:43:28] so it will be useful to do roll restarts without draining the cluster (if HA metastore works) [07:43:36] ok elukey - So now to metastore - To get HA in metastore is more complicated than for hive-server2? [07:44:38] a little different, so it needs [07:44:54] 1) DBToken enabled, so all metastores running save tokens on the db [07:45:10] And can therefore share sessions [07:45:14] 2) in hive-site.xml, the thrift:// url can contain multiple hostnmaes [07:45:17] yes correct [07:45:42] ah - so no cname here, multi-url [07:45:51] like [07:45:51] thrift://metastore1.example.com,thrift://metastore2.example.com,thrift://metastore3.example.com [07:46:20] now this gets interesting when tools like oozie needs to be configured [07:46:47] oozie/clickstream/coordinator.properties:hive_metastore_uri = thrift://an-coord1001.eqiad.wmnet:9083 [07:47:05] the "hive_metastore_uris" property, IIUC, is available only for oozie 4.3 [07:48:10] hm [07:48:35] like there was no metastore uri provided before? [07:48:53] Or it was accessed through hive-site maybe? [07:49:14] nono the above is an example of what we use, but it is "uri" not "uris" [07:49:17] https://issues.apache.org/jira/browse/OOZIE-2701 [07:49:23] Right [07:49:44] ah snap wait it says 5.x [07:49:48] * elukey cries in a corner [07:50:30] hm - wouldn't using a CNAME strategy work as well? [07:51:00] I thought that it was https://issues.apache.org/jira/browse/OOZIE-2431, will need to verify... [07:51:17] the CNAME could work, but I am wondering if it messes up the db state or not [07:51:37] I assume it could be possible elukey [07:52:52] oozie itself can also be active/standby, using zookeeper, but it seems a little overkill [07:53:18] and then there is the presto coordinator, that uses TLS and Kerberos [07:53:32] hm - Maybe for those having the ability to easily restart them on an-coord1002 instead of 1 is ok? [07:54:42] in theory in case of a failover we could just add them via puppet to an-coord1002 [07:55:10] I think it is acceptable for the immediate term, I really hope that we'll think about airflow HA rather than oozie :D [07:55:44] That would make me a lot happier indeed elukey :) [07:58:39] :) [08:05:08] !log roll restart druid public cluster for openjdk upgrades [08:05:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:05:14] let's see if aqs complains [08:24:29] so simply doing a roll restart doesn't blow up aqs [08:24:38] I think it is when something goes down permanently [08:24:41] like in a reboot [08:35:59] (03PS5) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077) [08:43:47] (03PS6) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077) [08:55:46] roll restart of druid completed [09:01:18] (03PS2) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 [09:01:47] Fun fact: I wasted several hours yesterday because of a one-character mistake I made in my tool :-S [09:01:52] Also, morning [09:01:58] heya klausman [09:02:20] good morning :) [09:02:47] Turns out, when you pass around a waitgroup (basically a mutex to make sure all workers have terminated), you should pass it around as a *reference,* not a value %-) [09:03:45] That said, Go was a lot more helpful than C++ would have been: the runtime detected the resulting deadlock and told me exactly where it was happening. Alas, I didn't spot my mistake until several hours later [09:04:33] it happens to everybody! [09:05:35] Yes, and then when I told kormat over beers in the evening, I got laughed at :D [09:05:39] yesterday me and Gabriele reviewed 100 times an ssh config to figure out what it wasn't working and we both didn't realize it was the hostnamae that was wrong [09:06:05] and I checked auth log on bastions and stat100x a lot of times [09:06:20] Nice! [09:06:30] but there was a clear "look Luca, this hostname is wrong, I cannot tell you otherwise, please stop" [09:20:07] My favorite failure mode is debugging something on a remote machine and one of your terminals is ssh'd to the entirely wrong machine [09:24:32] yes it get worse over time, the more you look at it the worse it gets [09:24:44] then you take a break, come back, and your brain restart working [09:35:36] (03PS3) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 [09:47:31] elukey: I'm very sorry - I'm putting a lot on pressure on the metastore now due to a badly configured test [09:49:09] pressure releved (I hope so :S) [09:54:39] Oh my :( hive has not yet recovered :( [09:56:14] nah seems ok [10:07:43] elukey: got bad error from spark trying to use metastore:( [10:09:11] (03CR) 10Joal: [V: 03+2] "Fully tested again on cluster with all sqoop job types" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal) [10:10:46] :) [10:12:49] elukey: could you please poke the hive-metastore? I think I broke it :( [10:13:48] PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:14:12] right [10:14:16] I think this is me :( [10:14:36] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:17:26] PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:17:36] PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:21:33] joal: checking [10:21:52] weird though this is only el to druid [10:22:16] if the metastore was broken I'd have expected way more firewords [10:22:20] *fireworks [10:23:34] the process is up, I see some errors logged [10:24:13] ah ok the above are read timeouts to the metastore [10:24:15] elukey: webrequest jobs are stuck - fireworks starts soon [10:24:37] I'm sorry elukey :( [10:25:45] ah I see some GC activity in https://grafana.wikimedia.org/d/000000379/hive?orgId=1 [10:27:03] !log restart hive server and metastore on an-coord1001 - openjdk upgrades + problem with high GC caused by a job [10:27:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:27:27] see I needed to drain the jobs to restart hive, done :D [10:27:48] !log restart oozie and presto-server on an-coord1001 for openjdk upgrades [10:27:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:21] !log restart eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 (failed) to see if the hive metastore works [10:29:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:38] yep looks fine [10:30:13] elukey: I confirm my spark job has started [10:30:22] Thanks a lot elukey [10:30:26] joal: what happened? No blame, it is fine, just curious :) [10:31:20] elukey: I tested my mediawiki-load job on empty tables, leading to many very-big repairs [10:32:09] ah ok so that was the GC activity [10:34:09] I am restarting the failed job [10:34:11] *jobs [10:35:12] RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:36:02] RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:38:52] RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:39:02] RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:02:41] all good! [11:04:23] Many thanks elukey <3 [11:08:27] elukey: question for you [11:08:46] the problem with metastore wath with repairing big tables all at once [11:09:16] elukey: this is one of the dowside of the change I have made to the mediawiki-load job [11:09:47] elukey: the size of the repairs should be a lot smaller (1 snapshot only, not 6), but there will be alot of tables trying to do so at once [11:13:33] elukey: while writing a long sentence saying I couldn't find a good idea, I think I got one :) [11:13:36] Will try it [11:13:49] * joal should write long sentences more often [11:14:28] :) [11:15:03] elukey: would you give me a minute of batcave? [11:15:31] I'll try to explain the options we have, and we can decide [11:16:44] joal: sure [11:33:51] (03PS19) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) [11:34:11] (03CR) 10Sbisson: Oozie job for Wikipedia Preview stats (031 comment) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [11:42:08] * klausman out for groceries and lunch [12:22:09] (03PS4) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077) [12:54:59] Taking a break [14:38:53] 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10SBisson) I think I was added to `analytics-privatedata-users` recently to work on an Oozie job so I should be fine. [14:51:02] !log roll restart zookeeper on druid* nodes for openjdk upgrades [14:51:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:53:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) >>! In T268801#6652970, @SBisson wrote: > I think I was added to `analytics-privatedata-users` recently to work on an Oozie job so I should be fine. Definitely,... [17:15:23] /away afk! [17:15:26] uff [17:15:31] afk people! have a good weekend :) [17:15:42] Bye elukey - have a good weekend :) [17:22:45] you too joal ! [17:22:56] Yes! o/ [17:59:45] ah missed luca's bye, byee! [18:54:19] (03CR) 10Mforns: Refactor oozie mediawiki-history-load job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal) [18:56:25] mforns: hola! [18:56:29] mforns: yt? [18:56:34] hello! [18:57:08] I owe you a review of the blog post, will do that today, taking advantage of thanksgiving [19:01:29] (03CR) 10Mforns: [C: 03+1] "Changes make sense. LGTM!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal) [19:08:02] mforns: someone suspended my account on phabricator, could you open a ticket to andre asking him to revive it? [19:08:31] mforns: no need to review blogpost yet, cause i am rewriting 50% of it today. Will ping you when you can take another pass [19:08:53] nuria: ok [19:11:48] nuria: https://phabricator.wikimedia.org/T268895 [19:11:59] mforns: super thanks [19:12:06] np! [19:56:11] (03PS1) 10Joal: Add tables to mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643985 (https://phabricator.wikimedia.org/T266077) [20:00:10] 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) a:03JAllemandou [20:00:18] 10Analytics, 10Analytics-Kanban: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) [20:01:41] (03PS7) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077) [20:22:37] Gone for tonight - Have a good weekend folks