[00:09:40] <wikibugs>	 (03PS18) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953)
[00:09:49] <wikibugs>	 (03CR) 10Sbisson: Oozie job for Wikipedia Preview stats (033 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson)
[07:09:18] <wikibugs>	 (03CR) 10Elukey: Oozie job for Wikipedia Preview stats (031 comment) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson)
[07:25:28] <joal>	 Good morning
[07:29:26] <elukey>	 bonjour!
[07:30:16] <elukey>	 so I think that oozie supports the multi-metastore uris only in bigtop's version
[07:30:57] <joal>	 elukey: please be gentle, coffee has not kicked in  ;)
[07:31:11] <elukey>	 ahhahaha sure sorry
[07:31:18] <joal>	 ;)
[07:31:19] <elukey>	 I was already in code review mode
[07:31:36] <joal>	 Let me phrase my understanding
[07:31:59] <elukey>	 I can add more words sorry
[07:32:05] <elukey>	 so I can explain the whole picture
[07:32:13] <elukey>	 (and it helps me to understand if it is right)
[07:32:42] <elukey>	 for some reason, the trick to have the metastore in HA is to
[07:32:44] <joal>	 On our path to full HA, we have mysql (mostly done, with CNAMES and kerberos stuff)
[07:32:51] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up automatic deletion/snitization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10ayounsi) >>! In T231339#6652009, @mforns wrote: > So, please, let us know if you guys have any periodic jobs that consume either of those 2 so...
[07:33:00] <joal>	 And the we need Hive-server2 and Metastore
[07:33:00] <elukey>	 ah sorry
[07:33:16] <joal>	 Right we're at the same spot, please go
[07:34:04] <elukey>	 the mysql part is mostly done, but there is still one caveat about the usage of "an-coord1001" in puppet etc.., since if I use analytics-hive I'll break the TLS certificate validation
[07:34:23] <elukey>	 every mysql node expose its puppet hostname cert for TLS
[07:34:40] <elukey>	 so in theory I'd need to add a specific one for analytics-mysql.eqiad.wmnet, or similar
[07:35:07] <joal>	 Ohhhh - kerberos principal is used in TLS certs?
[07:35:20] <elukey>	 nono for bare mysql stuff only TLS
[07:35:24] <elukey>	 with regular user/pass
[07:35:37] <elukey>	 (I mean say metastore db, superset db, etc..)
[07:35:45] <elukey>	 they all have in their config an-coord1001.eqiad.wmnet
[07:35:56] <elukey>	 so in case of failover, we'll need to replace and restart
[07:36:00] <elukey>	 but it is not a big deal
[07:36:07] <elukey>	 I mean, acceptable for the moment
[07:36:18] <joal>	 Ah - so in order to get full CNAME usage, not only kerb, we'd need CNAME TLS certs for tools to communicate to mysql
[07:36:36] <joal>	 very accepta
[07:36:39] <elukey>	 yes exactly, but we can do it later on if needed
[07:36:48] <elukey>	 then hive :)
[07:37:10] <elukey>	 so analytics-hive is ok for the server2, and we know it, but the metastore's HA set up is weird
[07:37:11] <joal>	 +ble, just trying to raise to your level elukey - meanwhile drinking a lot of coffee :)
[07:37:25] <elukey>	 ahahah no sorry for the brutal start of the friday, I can shut up
[07:37:40] <joal>	 all good :)
[07:38:23] <joal>	 elukey: can we please take a minute to review hive-server2 HA stuff (we'll move to metastore just after)
[07:39:43] <joal>	 So, hive-server2 is behind CNAME and accepts kerberos-CNAME principal - So if an-coord1001 fails, it's a cname move to an-coord1002 and it all should work - correct?
[07:40:45] <elukey>	 yes so at the final stage, yes
[07:41:01] <elukey>	 currently analytics-hive (the kerb principal) and the CNAME point to an-coord1002
[07:41:14] <elukey>	 and an-coord1001 is still running the "old" scheme
[07:41:14] <joal>	 Ah right
[07:41:21] <joal>	 yeah yeah
[07:41:22] <elukey>	 and they both use the same metastore
[07:41:25] <elukey>	 ack ack
[07:42:49] <elukey>	 but yes at the end of the journey both an-coords will run the same creds, analytics-hive, and simply flipping the CNAME will change traffic
[07:43:00] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up automatic deletion/snitization for netflow data set in Hive - https://phabricator.wikimedia.org/T231339 (10JAllemandou) > Here's a sketch of the migration plan, mostly a reference for myself! Please raise flags if something is missing. @mforns This...
[07:43:28] <elukey>	 so it will be useful to do roll restarts without draining the cluster (if HA metastore works)
[07:43:36] <joal>	 ok elukey - So now to metastore - To get HA in metastore is more complicated than for hive-server2?
[07:44:38] <elukey>	 a little different, so it needs
[07:44:54] <elukey>	 1) DBToken enabled, so all metastores running save tokens on the db 
[07:45:10] <joal>	 And can therefore share sessions
[07:45:14] <elukey>	 2) in hive-site.xml, the thrift:// url can contain multiple hostnmaes
[07:45:17] <elukey>	 yes correct
[07:45:42] <joal>	 ah - so no cname here, multi-url
[07:45:51] <elukey>	 like
[07:45:51] <elukey>	  <value>thrift://metastore1.example.com,thrift://metastore2.example.com,thrift://metastore3.example.com</value>
[07:46:20] <elukey>	 now this gets interesting when tools like oozie needs to be configured
[07:46:47] <elukey>	 oozie/clickstream/coordinator.properties:hive_metastore_uri                = thrift://an-coord1001.eqiad.wmnet:9083
[07:47:05] <elukey>	 the "hive_metastore_uris" property, IIUC, is available only for oozie 4.3
[07:48:10] <joal>	 hm
[07:48:35] <joal>	 like there was no metastore uri provided before?
[07:48:53] <joal>	 Or it was accessed through hive-site maybe?
[07:49:14] <elukey>	 nono the above is an example of what we use, but it is "uri" not "uris"
[07:49:17] <elukey>	 https://issues.apache.org/jira/browse/OOZIE-2701
[07:49:23] <joal>	 Right
[07:49:44] <elukey>	 ah snap wait it says 5.x
[07:49:48] * elukey cries in a corner
[07:50:30] <joal>	 hm - wouldn't using a CNAME strategy work as well?
[07:51:00] <elukey>	 I thought that it was https://issues.apache.org/jira/browse/OOZIE-2431, will need to verify...
[07:51:17] <elukey>	 the CNAME could work, but I am wondering if it messes up the db state or not
[07:51:37] <joal>	 I assume it could be possible elukey 
[07:52:52] <elukey>	 oozie itself can also be active/standby, using zookeeper, but it seems a little overkill
[07:53:18] <elukey>	 and then there is the presto coordinator, that uses TLS and Kerberos
[07:53:32] <joal>	 hm - Maybe for those having the ability to easily restart them on an-coord1002 instead of 1 is ok?
[07:54:42] <elukey>	 in theory in case of a failover we could just add them via puppet to an-coord1002
[07:55:10] <elukey>	 I think it is acceptable for the immediate term, I really hope that we'll think about airflow HA rather than oozie :D
[07:55:44] <joal>	 That would make me a lot happier indeed elukey :)
[07:58:39] <elukey>	 :)
[08:05:08] <elukey>	 !log roll restart druid public cluster for openjdk upgrades
[08:05:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:05:14] <elukey>	 let's see if aqs complains
[08:24:29] <elukey>	 so simply doing a roll restart doesn't blow up aqs
[08:24:38] <elukey>	 I think it is when something goes down permanently
[08:24:41] <elukey>	 like in a reboot
[08:35:59] <wikibugs>	 (03PS5) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077)
[08:43:47] <wikibugs>	 (03PS6) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077)
[08:55:46] <elukey>	 roll restart of druid completed
[09:01:18] <wikibugs>	 (03PS2) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033
[09:01:47] <klausman>	 Fun fact: I wasted several hours yesterday because of a one-character mistake I made in my tool :-S
[09:01:52] <klausman>	 Also, morning
[09:01:58] <joal>	 heya klausman 
[09:02:20] <elukey>	 good morning :)
[09:02:47] <klausman>	 Turns out, when you pass around a waitgroup (basically a mutex to make sure all workers have terminated), you should pass it around as a *reference,* not a value %-)
[09:03:45] <klausman>	 That said, Go was a lot more helpful than C++ would have been: the runtime detected the resulting deadlock and told me exactly where it was happening. Alas, I didn't spot my mistake until several hours later
[09:04:33] <elukey>	 it happens to everybody!
[09:05:35] <klausman>	 Yes, and then when I told kormat over beers in the evening, I got laughed at :D
[09:05:39] <elukey>	 yesterday me and Gabriele reviewed 100 times an ssh config to figure out what it wasn't working and we both didn't realize it was the hostnamae that was wrong
[09:06:05] <elukey>	 and I checked auth log on bastions and stat100x a lot of times
[09:06:20] <klausman>	 Nice!
[09:06:30] <elukey>	 but there was a clear "look Luca, this hostname is wrong, I cannot tell you otherwise, please stop"
[09:20:07] <klausman>	 My favorite failure mode is debugging something on a remote machine and one of your terminals is ssh'd to the entirely wrong machine
[09:24:32] <elukey>	 yes it get worse over time, the more you look at it the worse it gets
[09:24:44] <elukey>	 then you take a break, come back, and your brain restart working 
[09:35:36] <wikibugs>	 (03PS3) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033
[09:47:31] <joal>	 elukey: I'm very sorry - I'm putting a lot on pressure on the metastore now due to a badly configured test
[09:49:09] <joal>	 pressure releved (I hope so :S)
[09:54:39] <joal>	 Oh my :( hive has not yet recovered :(
[09:56:14] <elukey>	 nah seems ok
[10:07:43] <joal>	 elukey: got bad error from spark trying to use metastore:(
[10:09:11] <wikibugs>	 (03CR) 10Joal: [V: 03+2] "Fully tested again on cluster with all sqoop job types" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal)
[10:10:46] <elukey>	 :)
[10:12:49] <joal>	 elukey: could you please poke the hive-metastore? I think I broke it :(
[10:13:48] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:14:12] <joal>	 right
[10:14:16] <joal>	 I think this is me :(
[10:14:36] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:17:26] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:17:36] <icinga-wm>	 PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:21:33] <elukey>	 joal: checking
[10:21:52] <elukey>	 weird though this is only el to druid
[10:22:16] <elukey>	 if the metastore was broken I'd have expected way more firewords
[10:22:20] <elukey>	 *fireworks
[10:23:34] <elukey>	 the process is up, I see some errors logged 
[10:24:13] <elukey>	 ah ok the above are read timeouts to the metastore
[10:24:15] <joal>	 elukey: webrequest jobs are stuck - fireworks starts soon
[10:24:37] <joal>	 I'm sorry elukey :(
[10:25:45] <elukey>	 ah I see some GC activity in https://grafana.wikimedia.org/d/000000379/hive?orgId=1
[10:27:03] <elukey>	 !log restart hive server and metastore on an-coord1001 - openjdk upgrades + problem with high GC caused by a job
[10:27:05] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:27:27] <elukey>	 see I needed to drain the jobs to restart hive, done :D
[10:27:48] <elukey>	 !log restart oozie and presto-server on an-coord1001 for openjdk upgrades
[10:27:49] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:29:21] <elukey>	 !log restart eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 (failed) to see if the hive metastore works
[10:29:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[10:29:38] <elukey>	 yep looks fine
[10:30:13] <joal>	 elukey: I confirm my spark job has started
[10:30:22] <joal>	 Thanks a lot elukey 
[10:30:26] <elukey>	 joal: what happened? No blame, it is fine, just curious :)
[10:31:20] <joal>	 elukey: I tested my mediawiki-load job on empty tables, leading to many very-big repairs 
[10:32:09] <elukey>	 ah ok so that was the GC activity
[10:34:09] <elukey>	 I am restarting the failed job
[10:34:11] <elukey>	 *jobs
[10:35:12] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:36:02] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:38:52] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[10:39:02] <icinga-wm>	 RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[11:02:41] <elukey>	 all good!
[11:04:23] <joal>	 Many thanks elukey <3
[11:08:27] <joal>	 elukey: question for you
[11:08:46] <joal>	 the problem with metastore wath with repairing big tables all at once
[11:09:16] <joal>	 elukey: this is one of the dowside of the change I have made to the mediawiki-load job
[11:09:47] <joal>	 elukey: the size of the repairs should be a lot smaller (1 snapshot only, not 6), but there will be alot of tables trying to do so at once
[11:13:33] <joal>	 elukey: while writing a long sentence saying I couldn't find a good idea, I think I got one :)
[11:13:36] <joal>	 Will try it
[11:13:49] * joal should write long sentences more often
[11:14:28] <elukey>	 :)
[11:15:03] <joal>	 elukey: would you give me a minute of batcave?
[11:15:31] <joal>	 I'll try to explain the options we have, and we can decide
[11:16:44] <elukey>	 joal: sure 
[11:33:51] <wikibugs>	 (03PS19) 10Sbisson: Oozie job for Wikipedia Preview stats [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953)
[11:34:11] <wikibugs>	 (03CR) 10Sbisson: Oozie job for Wikipedia Preview stats (031 comment) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson)
[11:42:08] * klausman out for groceries and lunch
[12:22:09] <wikibugs>	 (03PS4) 10Joal: Refactor oozie mediawiki-history-load job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077)
[12:54:59] <joal>	 Taking a break
[14:38:53] <wikibugs>	 10Analytics, 10Analytics-Kanban: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10SBisson) I think I was added to `analytics-privatedata-users` recently to work on an Oozie job so I should be fine.
[14:51:02] <elukey>	 !log roll restart zookeeper on druid* nodes for openjdk upgrades
[14:51:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:53:36] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deprecate the 'researchers' posix group - https://phabricator.wikimedia.org/T268801 (10elukey) >>! In T268801#6652970, @SBisson wrote: > I think I was added to `analytics-privatedata-users` recently to work on an Oozie job so I should be fine.  Definitely,...
[17:15:23] <elukey>	  /away afk!
[17:15:26] <elukey>	 uff
[17:15:31] <elukey>	 afk people! have a good weekend :)
[17:15:42] <joal>	 Bye elukey - have a good weekend :)
[17:22:45] <elukey>	 you too joal !
[17:22:56] <joal>	 Yes! o/
[17:59:45] <mforns>	 ah missed luca's bye, byee!
[18:54:19] <wikibugs>	 (03CR) 10Mforns: Refactor oozie mediawiki-history-load job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal)
[18:56:25] <nuria>	 mforns: hola!
[18:56:29] <nuria>	 mforns: yt?
[18:56:34] <mforns>	 hello!
[18:57:08] <mforns>	 I owe you a review of the blog post, will do that today, taking advantage of thanksgiving
[19:01:29] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "Changes make sense. LGTM!" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643033 (https://phabricator.wikimedia.org/T266077) (owner: 10Joal)
[19:08:02] <nuria>	 mforns: someone suspended my account on phabricator, could you open a ticket to andre asking him to revive it? 
[19:08:31] <nuria>	 mforns: no need to review blogpost yet, cause i am rewriting 50% of it today. Will ping you when you can take another pass
[19:08:53] <mforns>	 nuria: ok
[19:11:48] <mforns>	 nuria: https://phabricator.wikimedia.org/T268895
[19:11:59] <nuria>	 mforns: super thanks
[19:12:06] <mforns>	 np!
[19:56:11] <wikibugs>	 (03PS1) 10Joal: Add tables to mediawiki-history-load [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643985 (https://phabricator.wikimedia.org/T266077)
[20:00:10] <wikibugs>	 10Analytics: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou) a:03JAllemandou
[20:00:18] <wikibugs>	 10Analytics, 10Analytics-Kanban: Fix purging pageview_actor data - https://phabricator.wikimedia.org/T268382 (10JAllemandou)
[20:01:41] <wikibugs>	 (03PS7) 10Joal: Update sqoop adding tables [analytics/refinery] - 10https://gerrit.wikimedia.org/r/643029 (https://phabricator.wikimedia.org/T266077)
[20:22:37] <joal>	 Gone for tonight - Have a good weekend folks