[02:08:50] <icinga-wm>	 PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[02:19:32] <icinga-wm>	 RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[07:13:15] <wikibugs>	 10Analytics, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) We have the following config on all stat nodes:  ` elukey@stat1008:~$ cat /etc/gitconfig # vim: set ts=4 sw=4 et: # Thi...
[07:29:01] <joal>	 Hi lexnasser - too late I am, but I have an answer to your question :)
[07:38:22] <joal>	 Good morning!
[07:40:55] <elukey>	 bonjour!
[07:45:41] <joal>	 elukey: would you give me a minute to talk about wikidata job? There is something I don't understand :)
[07:46:01] <elukey>	 yes sure, what do you mean?
[07:46:59] <joal>	 The coordinator's action missing SLA are numbered 1 and 2 - To me it means the oozie job has been restarted
[07:47:16] <joal>	 And I don't understand why :(
[07:47:43] <elukey>	 ah the last ones! Yes it was me, there were two failed schedules at the end of the coord
[07:47:57] <elukey>	 did it do something wrong? I thought they failed for waiting too much
[07:48:03] <elukey>	 I was about to reply
[07:48:25] <elukey>	 (I checked them via hue.w.o, hue next is unusable sigh)
[07:49:09] <joal>	 I manage my way in hue-next now, but not easily :S
[07:49:25] <elukey>	 I already opened all the issues on gh, but no answer
[07:49:28] <joal>	 elukey: I think that rerunning the failed ones would have been enough - it's ok nonetheless :)
[07:49:32] <elukey>	 I am really depressed
[07:49:57] <elukey>	 joal: yep yep I did re-run only the failed runs via "Rerun" on hue
[07:50:16] * joal sends wikilove to elukey
[07:50:33] <joal>	 Ok I'm lost :)
[07:50:44] <elukey>	 there is 
[07:50:44] <elukey>	 https://github.com/cloudera/hue/issues/1386
[07:50:47] <joal>	 Trying to revamp
[07:50:52] <elukey>	 https://github.com/cloudera/hue/issues/1373
[07:51:05] <elukey>	 https://github.com/cloudera/hue/issues/1273
[07:51:11] <elukey>	 https://github.com/cloudera/hue/issues/1272
[07:51:18] <elukey>	 for the last two dan created a patch
[07:51:28] <elukey>	 that I applied to our hue version, not yet upstream
[07:51:53] <elukey>	 so joal to recap what I did, that may be wrong (almost for sure)
[07:52:16] <elukey>	 1) I checked previous alerts and saw Marcel's script, and used it (with a little tweak) to create the eqiad SUCCESS flags
[07:52:35] <elukey>	 2) in https://hue.wikimedia.org/oozie/list_oozie_coordinator/0001961-201103154415936-oozie-oozi-C/ at the bottom I noticed two failed runs, among the 4 listed, so I've re-ran them
[07:52:54] <elukey>	 and this generated the two last alerts IIUC
[07:53:03] <joal>	 Right
[07:53:07] <joal>	 ok I get it now
[07:53:21] <joal>	 the thing I didn't get was that the instances 1 and 2 were the one having failed
[07:53:45] <elukey>	 ah ok is it wrong? I can kill them if not needed
[07:53:48] <joal>	 You did nothing wrong elukey :)
[07:54:21] <joal>	 There might be a missing step in moving the SUCCESS files from the folder where they've been generated to the prod folders
[07:55:07] * elukey nods
[07:55:28] <joal>	 And, about the failed jobs, I think it's worth investigating: I'm gonna check folders of parent-data and generated data, to reconstruct a current state of affair :)
[07:55:34] <joal>	 elukey: shall we do that together?
[07:56:22] <elukey>	 joal: the script to generate _SUCCESS flags is still running, should we wait for it to finish?
[07:56:27] <elukey>	 or do you want me to kill it?
[07:56:58] <joal>	 elukey: checking dates for which we need the flags
[07:59:42] <joal>	 elukey: from what I see there are success flags for every hour of every day since September - We should be good
[07:59:57] <joal>	 elukey: problem must come from something else (but WHAT?)
[08:01:39] <joal>	 HAH!
[08:01:41] <joal>	 I get it
[08:01:57] <joal>	 I checked for _SUCCESS flags - they are present - _REFINED files are missing!
[08:02:24] <joal>	 And our job is waiting for _REFINED files
[08:02:54] <elukey>	 right yes
[08:03:34] <joal>	 And from what I read in your email elukey you generate _SUCCESS files :)
[08:03:37] <elukey>	 ok then I am stopping my script
[08:03:48] <elukey>	 yes yes I didn't even check, followed what Marcel did, it is right
[08:04:41] <joal>	 ok, let's generate missing _REFINED flags :)
[08:05:08] <elukey>	 I can fix my script if it is ok and re-run it
[08:05:17] <elukey>	 the list of _REFINED is the hue one basically
[08:05:28] <elukey>	 (filtered uniqued etc..)
[08:05:32] <joal>	 elukey: works for me
[08:05:46] <joal>	 elukey: We might miss some historical flags but I think it;s ok
[08:06:12] <elukey>	 joal: I am a bit confused why the last time was _SUCCESS and this time _REFINED
[08:06:24] <joal>	 I think last time it didn't work either
[08:06:29] <elukey>	 hahaha okok
[08:06:37] <joal>	 But we have not triple checked :)
[08:06:42] <elukey>	 ok started
[08:06:47] * joal facepalm
[08:07:03] <joal>	 Thanks :)
[08:07:55] <elukey>	 in the meantime, I am merging https://gerrit.wikimedia.org/r/c/operations/homer/public/+/642268 to fix gerrit firewall rules for the great analytics firewall
[08:08:03] <elukey>	 if anything weird pops up lemme know :)
[08:08:28] <joal>	 Ack!
[08:09:07] <elukey>	 joal: also I'd need to restart the hadoop masters later on (openjdk upgrades + node rack changes) - ok for you?
[08:09:15] <joal>	 yessir!
[08:14:44] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @brennen can you retry and see if it is fixed? :)
[08:30:51] <elukey>	 actually I am going to kick off a roll restart of kafka jumbo first
[08:34:24] <elukey>	 (mirror maker first, then kafka)
[08:46:52] <joal>	 \o/ I got a map-reduce simple gobblin job working :)
[08:52:19] <elukey>	 wooowwww niceee
[08:53:19] <joal>	 Output data is really not formatted in a way we'd like and no time-related partitioning yet - But I overcame building and dependcies problems :)
[09:23:36] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) >>! In T268290#6635934, @elukey wrote: > @brennen can you retry and see if it is fixed? :)  Hmm, now I get:  ` kharlan@stat1008:~$ git clone...
[09:24:29] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) Oops, sorry the HTTPS clone was when I was experimenting with that instead of the SSH clone.  With ssh clone, I still get a timeout:  ` khar...
[09:30:25] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @kostajh sorry wrong ping in Phab :)  You are definitely right, I added the IPs related to gerrit1001 and gerrit2001, but apparently it is no...
[10:06:04] <wikibugs>	 10Analytics, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @kostajh should be better now, can you retry?
[10:29:20] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) >>! In T268290#6636219, @elukey wrote: > @kostajh should be better now, can you retry?  Yes! Now I just need to figure out the key issue.  `...
[11:27:29] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) OK, this now anonymous HTTPS clone works:  ` lang=bash git clone "https://gerrit.wikimedia.org/r/research/mwaddlink" && (cd "mwaddlink" && m...
[11:39:29] <elukey>	 going afk for lunch!!
[13:58:36] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) >>! In T268290#6636347, @kostajh wrote: > OK, this now anonymous HTTPS clone works: >  > ` lang=bash > git clone "https://gerrit.wikimedia.or...
[14:27:34] <elukey>	 hadoop masters restarted
[14:27:54] <elukey>	 they picked up also the new rack config (for the workers that we'll move next week)
[14:52:03] <joal>	 \o/
[15:55:51] <Amir1>	 elukey: hey, for when you have a bit of time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/642454/
[15:56:00] <Amir1>	 straightforward, noop for prod
[15:57:31] <elukey>	 Amir1: thanks a lot! Would it be too much to ask also to add types to the code review? Otherwise I can do it later on :)
[15:57:51] <Amir1>	 you mean typehinting?
[15:58:06] <elukey>	 types to the class parameters
[15:58:16] <Amir1>	 yeah, let me do it
[15:58:29] <elukey>	 thanks :)
[16:04:59] <elukey>	 razzi: good morning :) are you already creating the zookeper vm?
[16:05:26] <razzi>	 hi elukey, yes, I saw your comment on gerrit just now as well
[16:05:49] <razzi>	 I think the name you suggested is good; how to rename a ganeti vm?
[16:06:23] <elukey>	 yeah this is the issue, we need to kill it, and if it has already created the DNS etc.. it might need a decom step
[16:06:47] <elukey>	 but if you already decided the naming with Andrew I am fine, it is not super important
[16:07:04] <razzi>	 ah ok, the hard way I see :)
[16:07:23] <elukey>	 we have a wiki page that dcops manages with the naming for servers, so they know who to contact if they have questions etc..
[16:07:29] <elukey>	 lemme paste it in here
[16:08:28] <elukey>	 https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers
[16:11:09] <elukey>	 in theory when the create vm is done, we could simply run the decom cookbook to remove it, that should do the trick (in theory)
[16:13:21] <razzi>	 elukey: have a minute to sync in the batcave?
[16:15:04] <elukey>	 razzi: sure, gimme 2 min and I'll join
[16:18:56] <elukey>	 I am in :)
[16:21:33] <Amir1>	 elukey: finally done: https://gerrit.wikimedia.org/r/c/operations/puppet/+/642454
[16:21:40] <elukey>	 nice! will review in a bit
[16:21:42] <Amir1>	 PCC is happy
[16:47:54] <elukey>	 Amir1: merged! Thanks a lot
[16:48:30] <elukey>	 Amir1: also not sure if you already met (virtually) our new SRE, razzi 
[16:48:45] <elukey>	 if Analytics things are exploding you can now ping 3 people :)
[16:48:51] <elukey>	 (Andrew Razzi and me)
[17:08:48] <Amir1>	 Nice to meet you!
[17:09:05] <Amir1>	 Thanks. I will try to bother more people now mwhaha
[17:09:39] <elukey>	 :D
[17:13:49] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10brennen) Yeah, I know that we [[https://wikitech.wikimedia.org/wiki/Production_access#Security | disallow agent forwarding]], at least by policy and...
[17:19:33] <wikibugs>	 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) From the upcoming WikiReplicas architecture changes we have cataloged T2...
[17:21:54] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) 05Resolved→03Open
[17:21:57] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) Reopened and added my Director, Sumeet (Sbodington), for approval
[17:26:11] <wikibugs>	 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10Dzahn) >>! In T268290#6636347, @kostajh wrote: > Weirdly (to me anyway) there is no SSH key visible in my home directory on the server  SSH keys are...
[18:13:18] <lexnasser>	 joal or anyone else: I'm still unsure of how to get access to sqoop, as detailed in my message yesterday. If anyone knows how I can get that access, please let me know
[18:13:30] <joal>	 Hi lex
[18:13:52] <lexnasser>	 Hey!
[18:14:10] <joal>	 I can help
[18:14:33] <joal>	 lexnasser: Sqoop uses Mysql connections being the scene
[18:14:44] <joal>	 lexnasser: therefore you need mysql user and password :)
[18:15:17] <joal>	 For labsdb (the one you gave as example last night), the team has a dedicated user and poassword
[18:15:37] <joal>	 It is sensitive, but we have no easy other way than to share it among us
[18:28:19] <joal>	 Gone for diner
[18:45:05] <wikibugs>	 10Analytics: Kerberos identity - https://phabricator.wikimedia.org/T268365 (10fkaelin)
[19:30:08] * elukey afk!
[19:42:49] <wikibugs>	 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Legoktm) For reference: OLAP is https://en.wikipedia.org/wiki/Online_analytical_proc...
[20:10:13] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Sbodington) 05Open→03Resolved
[20:44:23] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/u 13 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:47:16] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) 05Resolved→03Open
[20:55:56] <wikibugs>	 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Sbodington) Approved.
[21:09:05] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:09:12] <razzi>	 !log truncate /var/lib/hadoop/data/u/yarn/logs/application_1605880843685_0581/container_e27_1605880843685_0581_01_000171/stderr logfile on an-worker1098
[21:09:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:13:55] <joal>	 GoranSM: Looks like the log trick didn't work (see worker space above)
[21:20:58] <joal>	 Thanks a lot razzi for the quick action!
[21:29:09] <mforns>	 uou forgot to change back to normal
[21:30:31] <mforns>	 heya teamm if anyone wants to have a peak at the eventstreams ui, I finished a first version, it's here: https://gerrit.wikimedia.org/r/c/mediawiki/services/eventstreams/+/642542 feedback welcome!
[21:58:19] <wikibugs>	 10Analytics: Kerberos identity for fkaelin - https://phabricator.wikimedia.org/T268365 (10Peachey88)