[02:08:50] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:19:32] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:13:15] 10Analytics, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) We have the following config on all stat nodes: ` elukey@stat1008:~$ cat /etc/gitconfig # vim: set ts=4 sw=4 et: # Thi... [07:29:01] Hi lexnasser - too late I am, but I have an answer to your question :) [07:38:22] Good morning! [07:40:55] bonjour! [07:45:41] elukey: would you give me a minute to talk about wikidata job? There is something I don't understand :) [07:46:01] yes sure, what do you mean? [07:46:59] The coordinator's action missing SLA are numbered 1 and 2 - To me it means the oozie job has been restarted [07:47:16] And I don't understand why :( [07:47:43] ah the last ones! Yes it was me, there were two failed schedules at the end of the coord [07:47:57] did it do something wrong? I thought they failed for waiting too much [07:48:03] I was about to reply [07:48:25] (I checked them via hue.w.o, hue next is unusable sigh) [07:49:09] I manage my way in hue-next now, but not easily :S [07:49:25] I already opened all the issues on gh, but no answer [07:49:28] elukey: I think that rerunning the failed ones would have been enough - it's ok nonetheless :) [07:49:32] I am really depressed [07:49:57] joal: yep yep I did re-run only the failed runs via "Rerun" on hue [07:50:16] * joal sends wikilove to elukey [07:50:33] Ok I'm lost :) [07:50:44] there is [07:50:44] https://github.com/cloudera/hue/issues/1386 [07:50:47] Trying to revamp [07:50:52] https://github.com/cloudera/hue/issues/1373 [07:51:05] https://github.com/cloudera/hue/issues/1273 [07:51:11] https://github.com/cloudera/hue/issues/1272 [07:51:18] for the last two dan created a patch [07:51:28] that I applied to our hue version, not yet upstream [07:51:53] so joal to recap what I did, that may be wrong (almost for sure) [07:52:16] 1) I checked previous alerts and saw Marcel's script, and used it (with a little tweak) to create the eqiad SUCCESS flags [07:52:35] 2) in https://hue.wikimedia.org/oozie/list_oozie_coordinator/0001961-201103154415936-oozie-oozi-C/ at the bottom I noticed two failed runs, among the 4 listed, so I've re-ran them [07:52:54] and this generated the two last alerts IIUC [07:53:03] Right [07:53:07] ok I get it now [07:53:21] the thing I didn't get was that the instances 1 and 2 were the one having failed [07:53:45] ah ok is it wrong? I can kill them if not needed [07:53:48] You did nothing wrong elukey :) [07:54:21] There might be a missing step in moving the SUCCESS files from the folder where they've been generated to the prod folders [07:55:07] * elukey nods [07:55:28] And, about the failed jobs, I think it's worth investigating: I'm gonna check folders of parent-data and generated data, to reconstruct a current state of affair :) [07:55:34] elukey: shall we do that together? [07:56:22] joal: the script to generate _SUCCESS flags is still running, should we wait for it to finish? [07:56:27] or do you want me to kill it? [07:56:58] elukey: checking dates for which we need the flags [07:59:42] elukey: from what I see there are success flags for every hour of every day since September - We should be good [07:59:57] elukey: problem must come from something else (but WHAT?) [08:01:39] HAH! [08:01:41] I get it [08:01:57] I checked for _SUCCESS flags - they are present - _REFINED files are missing! [08:02:24] And our job is waiting for _REFINED files [08:02:54] right yes [08:03:34] And from what I read in your email elukey you generate _SUCCESS files :) [08:03:37] ok then I am stopping my script [08:03:48] yes yes I didn't even check, followed what Marcel did, it is right [08:04:41] ok, let's generate missing _REFINED flags :) [08:05:08] I can fix my script if it is ok and re-run it [08:05:17] the list of _REFINED is the hue one basically [08:05:28] (filtered uniqued etc..) [08:05:32] elukey: works for me [08:05:46] elukey: We might miss some historical flags but I think it;s ok [08:06:12] joal: I am a bit confused why the last time was _SUCCESS and this time _REFINED [08:06:24] I think last time it didn't work either [08:06:29] hahaha okok [08:06:37] But we have not triple checked :) [08:06:42] ok started [08:06:47] * joal facepalm [08:07:03] Thanks :) [08:07:55] in the meantime, I am merging https://gerrit.wikimedia.org/r/c/operations/homer/public/+/642268 to fix gerrit firewall rules for the great analytics firewall [08:08:03] if anything weird pops up lemme know :) [08:08:28] Ack! [08:09:07] joal: also I'd need to restart the hadoop masters later on (openjdk upgrades + node rack changes) - ok for you? [08:09:15] yessir! [08:14:44] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @brennen can you retry and see if it is fixed? :) [08:30:51] actually I am going to kick off a roll restart of kafka jumbo first [08:34:24] (mirror maker first, then kafka) [08:46:52] \o/ I got a map-reduce simple gobblin job working :) [08:52:19] wooowwww niceee [08:53:19] Output data is really not formatted in a way we'd like and no time-related partitioning yet - But I overcame building and dependcies problems :) [09:23:36] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) >>! In T268290#6635934, @elukey wrote: > @brennen can you retry and see if it is fixed? :) Hmm, now I get: ` kharlan@stat1008:~$ git clone... [09:24:29] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) Oops, sorry the HTTPS clone was when I was experimenting with that instead of the SSH clone. With ssh clone, I still get a timeout: ` khar... [09:30:25] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @kostajh sorry wrong ping in Phab :) You are definitely right, I added the IPs related to gerrit1001 and gerrit2001, but apparently it is no... [10:06:04] 10Analytics, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) @kostajh should be better now, can you retry? [10:29:20] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) >>! In T268290#6636219, @elukey wrote: > @kostajh should be better now, can you retry? Yes! Now I just need to figure out the key issue. `... [11:27:29] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10kostajh) OK, this now anonymous HTTPS clone works: ` lang=bash git clone "https://gerrit.wikimedia.org/r/research/mwaddlink" && (cd "mwaddlink" && m... [11:39:29] going afk for lunch!! [13:58:36] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10elukey) >>! In T268290#6636347, @kostajh wrote: > OK, this now anonymous HTTPS clone works: > > ` lang=bash > git clone "https://gerrit.wikimedia.or... [14:27:34] hadoop masters restarted [14:27:54] they picked up also the new rack config (for the workers that we'll move next week) [14:52:03] \o/ [15:55:51] elukey: hey, for when you have a bit of time: https://gerrit.wikimedia.org/r/c/operations/puppet/+/642454/ [15:56:00] straightforward, noop for prod [15:57:31] Amir1: thanks a lot! Would it be too much to ask also to add types to the code review? Otherwise I can do it later on :) [15:57:51] you mean typehinting? [15:58:06] types to the class parameters [15:58:16] yeah, let me do it [15:58:29] thanks :) [16:04:59] razzi: good morning :) are you already creating the zookeper vm? [16:05:26] hi elukey, yes, I saw your comment on gerrit just now as well [16:05:49] I think the name you suggested is good; how to rename a ganeti vm? [16:06:23] yeah this is the issue, we need to kill it, and if it has already created the DNS etc.. it might need a decom step [16:06:47] but if you already decided the naming with Andrew I am fine, it is not super important [16:07:04] ah ok, the hard way I see :) [16:07:23] we have a wiki page that dcops manages with the naming for servers, so they know who to contact if they have questions etc.. [16:07:29] lemme paste it in here [16:08:28] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers [16:11:09] in theory when the create vm is done, we could simply run the decom cookbook to remove it, that should do the trick (in theory) [16:13:21] elukey: have a minute to sync in the batcave? [16:15:04] razzi: sure, gimme 2 min and I'll join [16:18:56] I am in :) [16:21:33] elukey: finally done: https://gerrit.wikimedia.org/r/c/operations/puppet/+/642454 [16:21:40] nice! will review in a bit [16:21:42] PCC is happy [16:47:54] Amir1: merged! Thanks a lot [16:48:30] Amir1: also not sure if you already met (virtually) our new SRE, razzi [16:48:45] if Analytics things are exploding you can now ping 3 people :) [16:48:51] (Andrew Razzi and me) [17:08:48] Nice to meet you! [17:09:05] Thanks. I will try to bother more people now mwhaha [17:09:39] :D [17:13:49] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10brennen) Yeah, I know that we [[https://wikitech.wikimedia.org/wiki/Production_access#Security | disallow agent forwarding]], at least by policy and... [17:19:33] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) From the upcoming WikiReplicas architecture changes we have cataloged T2... [17:21:54] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) 05Resolved→03Open [17:21:57] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) Reopened and added my Director, Sumeet (Sbodington), for approval [17:26:11] 10Analytics, 10Gerrit, 10Release-Engineering-Team (Development services): Unable to clone git repo from stat1008 - https://phabricator.wikimedia.org/T268290 (10Dzahn) >>! In T268290#6636347, @kostajh wrote: > Weirdly (to me anyway) there is no SSH key visible in my home directory on the server SSH keys are... [18:13:18] joal or anyone else: I'm still unsure of how to get access to sqoop, as detailed in my message yesterday. If anyone knows how I can get that access, please let me know [18:13:30] Hi lex [18:13:52] Hey! [18:14:10] I can help [18:14:33] lexnasser: Sqoop uses Mysql connections being the scene [18:14:44] lexnasser: therefore you need mysql user and password :) [18:15:17] For labsdb (the one you gave as example last night), the team has a dedicated user and poassword [18:15:37] It is sensitive, but we have no easy other way than to share it among us [18:28:19] Gone for diner [18:45:05] 10Analytics: Kerberos identity - https://phabricator.wikimedia.org/T268365 (10fkaelin) [19:30:08] * elukey afk! [19:42:49] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Legoktm) For reference: OLAP is https://en.wikipedia.org/wiki/Online_analytical_proc... [20:10:13] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Sbodington) 05Open→03Resolved [20:44:23] PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/u 13 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:47:16] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10JAnstee_WMF) 05Resolved→03Open [20:55:56] 10Analytics, 10Operations, 10SRE-Access-Requests: Requesting access to production shell groups for JAnstee - https://phabricator.wikimedia.org/T266249 (10Sbodington) Approved. [21:09:05] RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:09:12] !log truncate /var/lib/hadoop/data/u/yarn/logs/application_1605880843685_0581/container_e27_1605880843685_0581_01_000171/stderr logfile on an-worker1098 [21:09:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:13:55] GoranSM: Looks like the log trick didn't work (see worker space above) [21:20:58] Thanks a lot razzi for the quick action! [21:29:09] uou forgot to change back to normal [21:30:31] heya teamm if anyone wants to have a peak at the eventstreams ui, I finished a first version, it's here: https://gerrit.wikimedia.org/r/c/mediawiki/services/eventstreams/+/642542 feedback welcome! [21:58:19] 10Analytics: Kerberos identity for fkaelin - https://phabricator.wikimedia.org/T268365 (10Peachey88)