[01:48:15] <icinga-wm>	 PROBLEM - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[06:09:29] <elukey>	 goood morning
[06:17:46] <elukey>	 lovely we crossed the 2PB mark on hdfs :(
[06:39:03] <elukey>	 !log manually ran "/usr/bin/find /srv/backup/hadoop/namenode -mtime +15 -delete" on an-master1002 to free some space in the backup partition
[06:39:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:39:58] <elukey>	 we keep around 20 days of hdfs fsimages (4.2G each nowadays), plus the lvs backup for the analytics-meta instance, that is around 50G
[06:40:16] <elukey>	 I need to finish the work on the new backup infra to remove this extra backup on the namenode
[06:41:40] <elukey>	 well back to HDFS, we crossed the 2PB mark
[06:41:51] <elukey>	 and some workers are showing partitions getting filled
[06:42:20] <elukey>	 I'll try to add asap those 6 nodes to increase space, but i am not ready yet, so some space needs to be freed
[06:45:19] <elukey>	 !log on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs/*
[06:45:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:47:38] <elukey>	 so I am checking the most used log dirs for yarn logs
[06:47:38] <elukey>	 elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls  /var/log/hadoop-yarn/apps/mirrys/logs
[06:47:41] <elukey>	 Found 106 items
[06:47:44] <elukey>	 drwxrwx---   - mirrys mapred          0 2020-07-14 11:52 /var/log/hadoop-yarn/apps/mirrys/logs/application_1592377297555_131595
[06:47:47] <elukey>	 ...
[06:47:58] <elukey>	 given the timestamp, our 40 days drop config doesn't seem to work
[06:48:48] <elukey>	 on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/*
[06:48:58] <elukey>	 !log on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/*
[06:48:59] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[06:49:43] <elukey>	 dropped ~100T
[07:13:07] <wikibugs>	 10Analytics: Check home/HDFS leftovers of jkumarah - https://phabricator.wikimedia.org/T263715 (10MoritzMuehlenhoff)
[07:21:38] <klausman>	 elukey: Morning! Any last things I should be aware of before reimaging 1006?
[07:22:06] <elukey>	 klausman: morning! nope, please proceed
[07:29:43] <klausman>	 !log Starting reimaging of stat1006
[07:29:45] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[07:32:07] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1006.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-...
[07:39:13] <elukey>	 joal: bonjour - before stopping timers etc.. lemme know if it is a good time this morning to do the TLS maintenance for hadoop
[07:39:50] <elukey>	 we could in theory not even go into safe mode
[07:42:31] <klausman>	 elukey: btw, will puppet break this time around again? I found https://wikitech.wikimedia.org/wiki/Puppet#Reinstalls but it that seems to indicate that no manual intervention should *nromally* be necessary
[07:44:56] <elukey>	 klausman: I hope not, we need to see if the bug is fixed
[07:45:04] <klausman>	 Roger
[07:45:16] <klausman>	 First boot of new install is about to happen
[07:45:46] <elukey>	 super
[07:46:03] <klausman>	 and we're back
[07:46:57] <elukey>	 the first puppet runs will take a bit, but if the reimage script doesn't break it should mean that the old bug is fixed
[07:47:09] <klausman>	 Ack
[07:47:14] <elukey>	 there are still some issues here and there but should be ok
[07:48:03] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1006.eqiad.wmnet'] `  Of which those **FAILED**: ` ['stat1006.eqiad.wmnet'] `
[07:48:28] <klausman>	 07:47:51 | stat1006.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run
[07:48:30] <klausman>	 Welp.
[07:49:06] <klausman>	 Was there anything to do beyond the steps on the wikipage I linked?
[07:50:01] <elukey>	 that part is taken care by wmf-auto-reimage, no need to execute those commands
[07:50:18] <klausman>	 But it said the first run failed?
[07:50:44] <elukey>	 yes but it is due to the puppet code, still that bug that I hoped to have fixed
[07:51:01] <klausman>	 I'm confused.
[07:51:39] <klausman>	 As I understood it, the first puppet run triggered by the install script should work.
[07:51:59] <klausman>	 But here, it didn't. So is there anything that needs doing?
[07:52:18] <elukey>	 yep yep with the assumption that the puppet code for the role works in every condition, like on a fresh node
[07:52:44] <elukey>	 so in theory, best case scenario, no puppet code bugs etc.. and the first puppet run works
[07:53:22] <elukey>	 often (and this is also a good consequence of reimaging) the puppet code has bugs and the first puppet run shows them
[07:53:29] <elukey>	 for example, let's take this one
[07:53:49] <elukey>	 if you go on puppetmaster1001 and use install_console, then /var/log/puppet.log should show
[07:53:57] <elukey>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed getting spark version via facter. (file: /etc/puppet/modules/profile/manifests/hadoop/spark2.pp, line: 101, column: 9) on node stat1006.eqiad.wmne
[07:54:16] <elukey>	 that is the same bug we had with 1004
[07:54:47] <elukey>	 so wmf-auto-reimage tried to kick off a puppet run, but the return code was non zero so it stopped
[07:55:05] <klausman>	 Ok. So the manual fix for 1006 is to install spark, so facter sees something, then do a puppet run?
[07:55:28] <elukey>	 yes sadly, since for some reason that code is not completely working
[07:55:43] <elukey>	 another alternative, that surely works, is to use hiera's lookup
[07:55:44] <elukey>	 (brb)
[07:55:50] <elukey>	 we could fix it now and see if it works
[07:56:01] <klausman>	 Btw, there is no puppet log on 1006
[07:58:34] <elukey>	 ah then just run puppet to see, I thought there was
[08:00:26] <klausman>	 It *is* the Spark failure
[08:00:26] <elukey>	 so if we want to unblock this now, we can just install spark2 via install_console, run puppet, and then reboot the host
[08:01:07] <klausman>	 How much work would testing a fix for the puppet role be?
[08:02:11] <elukey>	 if we add a hiera lookup, it is a puppet change that should take 10/15 mins, but we may want to follow up with andrew later on to figure out if he wants to fix the facter's code
[08:02:36] <elukey>	 in case he doesn't want, we can add the hiera lookup before 1007 and that's it
[08:03:05] <elukey>	 so manual fix for now, then puppet change before 1007 could be a good compromise
[08:03:08] <elukey>	 as you prefer
[08:03:30] <klausman>	 I agree with manual now, proper fix for 1007
[08:03:45] <klausman>	 will install spark2, run puppet and then reboot
[08:03:52] <elukey>	 okok
[08:04:12] <klausman>	 are the defaults ok? (regarding suggests: etc)
[08:06:06] <elukey>	 klausman: what do you mean?
[08:07:00] <klausman>	 Sometimes, one wants to install a package but not the Suggests: and Recommends: stuff. It's something like –no-install-recommends 
[08:10:05] <elukey>	 ahhh okok
[08:10:31] <elukey>	 I think spark2 is a big jar container, should be ok 
[08:11:08] <elukey>	 (it doesn't bring more packages in etc..)
[08:11:09] <klausman>	 Alrighty
[08:13:24] <klausman>	 Man Puppet sure doesn't *look* fast
[08:14:54] <elukey>	 we install a ton of things on those nodes :(
[08:18:35] <elukey>	 klausman: I forgot one thing, namely https://phabricator.wikimedia.org/T262609
[08:19:15] <elukey>	 so on all hosts puppet creates a new v6 interface with the last 64bits containing the v4 adddress mapped
[08:19:38] <elukey>	 on stat1004 we have for example inet6 2620:0:861:104:10:64:5:104/64 scope global mngtmpaddr dynamic
[08:19:50] <elukey>	 so we can put those as AAAA records etc..
[08:19:56] <elukey>	 (rather than relying on autoconfig)
[08:20:29] <klausman>	 So far, puppet doesn't seem *stalled*, just slow
[08:20:32] <elukey>	 the first puppet run might get stuck at the stage when the v6 interface is added/changed, since if it uses a v6 connection it breaks
[08:20:54] <klausman>	 I'll keep an eye on the run, and do the ctrl-c and restart thing mentioned by Moritz
[08:21:07] <klausman>	 (if it gets stuck, that is)
[08:21:09] <elukey>	 yeah but it might happen  that it gets stuck at that stage, if so either use the --source-address etc.. or ctrl+c etc.. as you said
[08:21:12] <elukey>	 perfect
[08:29:20] <elukey>	 klausman: I'd need to step away from keyboard for max 1h (I have to bring my car to do the bi-yearly check), ok if go? 
[08:32:16] <klausman>	 sure
[08:32:21] <elukey>	 super thanks, ttl :)
[08:41:34] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10Operations: Grant access to archiva-deployers for mstyles - https://phabricator.wikimedia.org/T242624 (10hashar)
[08:41:42] <wikibugs>	 10Analytics, 10LDAP-Access-Requests, 10Operations: Grant access to archiva-deployers for zpapierski - https://phabricator.wikimedia.org/T242622 (10hashar)
[09:03:56] <klausman>	 stat1006 is back, and all clear mail has been sent.
[09:10:32] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10klausman) Reimaging complete. The failure above is the failed first run of puppet due to no spark being installed. I did that manually, ran puppet rebooted for the kernel opts a...
[09:39:15] <elukey>	 klausman: back
[09:39:26] <klausman>	 welcome back.
[09:40:06] <klausman>	 Should we just nuke the 1004 backup on 1008 now, and the 1006 one on Monday? (I *think* the 1007 backup might fit if 1004 is gone
[09:41:03] <elukey>	 !log force re-creation of jupyterhub's default venv on stat1006 after reimage
[09:41:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:41:20] <elukey>	 klausman: yep please drop the 1004 backup
[09:41:39] <klausman>	 Running....
[09:42:08] <elukey>	 stat1006 looks good!
[09:42:29] <elukey>	 testing jupyterhub
[09:42:42] <klausman>	 excellent. one question: why was backing up to one of the labstore machines not considered for the /srv backups>
[09:44:23] <elukey>	 in theory those are not our hosts, they host different things
[09:44:39] <klausman>	 Alright.
[09:45:08] <elukey>	 there is also another option now that I think about. Joseph wrote a while ago a tool called hdfs-rsync, that mimics the rsync command but to/from hdfs
[09:47:18] <elukey>	 (tested a spark sql query via pyspark yarn notebook + kinit, all good on 1006)
[09:47:21] <klausman>	 I also just figured out that 1007 has 4.5T of data, but unless we delete the 1006 backup as well, we have no stat machine with enough space to backup that
[09:47:46] <elukey>	 I think that we can drop it just before starting the backup tomorrow
[09:48:23] <klausman>	 Yeah, makes sense.
[09:50:51] <elukey>	 I see that the convert_xml_to_parquet step of the mw-wikitext job is almost done, I may have a window to stop all the jobs and swap the tls certs
[09:51:06] <elukey>	 !log stop all timers on an-launcher1002 to ease maintenance
[09:51:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:51:12] <klausman>	 Just yell if want any help
[09:51:53] <elukey>	 yep yep it should be an easy procedure, but I have to do it in say ~1h due to cluster draining
[09:52:04] <elukey>	 I can explain what I am going to do if you are interested
[09:52:12] <klausman>	 Sure, I'll shoulder-surf
[09:52:20] <elukey>	 context is in https://phabricator.wikimedia.org/T253957
[09:52:55] <elukey>	 basically Hadoop has a wide variety of auth/encryption protocols, added over the years, so fun to manage all of those (</sarcasm>)
[09:53:12] <elukey>	 TLS is used for two big things
[09:53:33] <elukey>	 1) encrypt/auth traffic from the shufflers to the reducers (yarn)
[09:53:51] <elukey>	 2) encrypt/auth traffic from the HDFS Namenode to the journalnodes
[09:54:53] <elukey>	 we use a self signed CA and certs created via https://wikitech.wikimedia.org/wiki/Cergen currently, and instead of keep renewing the CA etc.. we thought to use puppet host level TLS certs
[09:55:04] <elukey>	 John in SRE did two things
[09:55:16] <elukey>	 1) add the puppet CA to the default truststores of all the JVMs
[09:55:44] <elukey>	 2) add some defines/classes to be able to wrap the puppet tls host's pem into a pkcs12 keystore
[09:56:02] <elukey>	 the maintenance is about swapping the config 
[09:56:37] <elukey>	 case 1), the shufflers, is easy since it just need a roll restart of the node managers (that can happen anytime without affecting jobs etc..), but the cluster is better to be drained first
[09:56:51] <elukey>	 the second case, hdfs journanodes, is a little bit more delicate
[09:57:11] <elukey>	 the journal nodes are meant to keep a replicated HDFS edit log
[09:57:36] <elukey>	 that is basically a stream of changes to HDFS, not yet packed/compacted into one fsimage
[09:57:42] <klausman>	 Ah, so just shooting them in the head is very ill-advised
[09:58:57] <elukey>	 we can roll restart them in theory, one at the time, and a mixed TLS certs cluster should be ok since comms happens only from namenode (an-master*) to journal daemons (so the journal daemons don't really have any form of consensus algorithm, it is all handled  by the namenode)
[09:59:18] <elukey>	 and the namenode now should trust the puppet CA
[09:59:37] <elukey>	 but we could play on the safe side and set something called "HDFS safe mode"
[09:59:44] <elukey>	 the fs goes in read only basically
[10:00:24] <klausman>	 Is that much slower/more work?
[10:00:38] <elukey>	 some seconds to enter/exit safe mode, it is one command
[10:01:44] <klausman>	 Then that sounds like a good idea
[10:02:01] <klausman>	 (very unrelatedly, just spotted something: systemd[1]: [/etc/systemd/system/user.slice.d/puppet-override.conf:7] Memory limit '0' out of range. Ignoring.)
[10:02:35] <klausman>	 That's MemorySwapMax=0
[10:04:12] <elukey>	 ah interesting, IIRC we set no swap for the cgroups containing regular users on stat100x, but not sure if really needed
[10:04:26] <klausman>	 it's not working, at any rate :)
[10:05:46] <elukey>	 I am wondering if it is not working on buster, but it did on stretch
[10:05:54] <elukey>	 anyway, I think it is ok to remove it
[10:06:03] <elukey>	 do you want to file the puppet change klausman ?
[10:07:00] <klausman>	 1007 does not have the message about the 0 limit
[10:07:07] <klausman>	 Yes, will do
[10:08:26] <elukey>	 thanks!
[10:12:01] <wikibugs>	 10Analytics: Drop MemorySwapMax=0 from analytics puppet roles - https://phabricator.wikimedia.org/T263731 (10klausman)
[10:15:39] <klausman>	 *sigh* including a line longer than 100 characters is a big nono, even if it is a verbatim log message...
[10:19:01] <klausman>	 And merged
[10:19:58] <elukey>	 good
[10:20:01] <klausman>	 Who would I poke about modules/profile/files/toolforge/bastion-user-resource-control.conf? They have the same setting we have which won't work on Buster
[10:20:09] <klausman>	 we ahd*
[10:20:14] <elukey>	 arturo on IRC is the best poc
[10:20:25] <klausman>	 Ok, will give him a shout
[10:23:00] <wikibugs>	 10Analytics: Drop MemorySwapMax=0 from analytics puppet roles - https://phabricator.wikimedia.org/T263731 (10klausman) 05Open→03Resolved
[11:00:05] <elukey>	 going afk for lunch, the timers are still disabled, waiting for the cluster to drain
[11:00:08] <elukey>	 (as FYI)
[11:00:35] <elukey>	 I didn't start any tls cert swap, so in case it is needed just re-enable puppet on an-launcher1002 and run it to restore all the jobs etc..
[11:00:38] * elukey lunch
[11:17:30] <joal>	 Hi team - I'm sorry I'm super late today
[11:17:47] <joal>	 Looking at the cluster drain now
[11:27:37] <joal>	 here is my status: ongoing jobs are the mediawiki-wikitext-history one (let's not worry about it, it's not yet done) - and the pageview-histoirical backfilling
[11:28:36] <joal>	 https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0036183-200915132022208-oozie-oozi-C
[11:30:42] <joal>	 I think we're good to stop the cluster when you wish
[11:37:47] <wikibugs>	 10Analytics, 10Analytics-Kanban: Improve mediawiki-wikitext spark job repartitioning - https://phabricator.wikimedia.org/T263736 (10JAllemandou) a:03JAllemandou
[11:38:05] <wikibugs>	 (03PS1) 10Joal: Update MediawikiXMLDumpsConverter repartitioning [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629659 (https://phabricator.wikimedia.org/T263736)
[11:52:04] <elukey>	 joal: o/
[11:52:08] <elukey>	 I am here if you are
[11:52:13] <joal>	 Hi elukey - I'm sorry for my lateness :S
[11:52:45] <elukey>	 nono please don't say that, there are plenty of things to do anyway as you well know :D
[11:53:12] <joal>	 elukey: looking a bit more in detail to mediawiki-wikitesxt job allowed me to pinpoint an easy performance improvement (see task and patch just above)
[11:53:26] <joal>	 elukey: we can start when you wish
[11:54:23] <elukey>	 ok prepping the change
[11:56:27] <joal>	 also I just noticed there are actually 2 backfilling jobs for pageview-historical, the one above and that one: 
[11:56:33] <joal>	 https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0054362-200915132022208-oozie-oozi-C
[11:56:52] <klausman>	 elukey: what's the best way to follow along?
[11:57:53] <elukey>	 klausman: I can write what I am doing in here at every step if you are ok
[11:58:13] <klausman>	 Sure.
[11:58:20] <elukey>	 the change is very simple - https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663/
[11:58:35] <elukey>	 I just executed from cumin 
[11:58:35] <elukey>	 sudo cumin 'c:profile::hadoop::common' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663"
[11:58:45] <elukey>	 just to be on the safe side
[11:59:23] <elukey>	 joal: ah so we need to wait?
[12:00:24] <elukey>	 they are doing the hive step, maybe better to wait
[12:00:30] <elukey>	 can we pause them afterwards?
[12:00:59] <joal>	 no o, no need to wait
[12:01:11] <joal>	 when steps are done, new steps start right awa
[12:01:29] <joal>	 so let's not wait, I'll restart failed instances as needed
[12:01:37] <joal>	 elukey, klausman --^
[12:02:00] <klausman>	 aye aye
[12:02:26] <elukey>	 ack, prepping for the change
[12:04:19] <elukey>	 merged the puppet change
[12:04:32] <elukey>	 now I am going to test what happens on analytics1042
[12:04:55] <elukey>	 mmm no better to start from a host with journal nodes
[12:05:18] <elukey>	 klausman: we have the list of journal nodes in hieradata/common.yaml -> hadoop_analytics
[12:05:38] <elukey>	 sorry hadoop_clusters -> analytics-hadoop
[12:05:59] <elukey>	 I pick analytics1052.eqiad.wmnet as testbed, so I can run puppet and restart both yarn and journalnode daemons
[12:06:07] <elukey>	 that should not return to me any horrible error messages
[12:06:13] <elukey>	 if so, I'll rollback
[12:06:20] <klausman>	 *nod*
[12:06:59] * joal waits in wonder about possible horrible horror messages
[12:08:25] <klausman>	 in the distance, explosions and sirens
[12:11:44] <elukey>	 all right all daemons are good
[12:11:54] <elukey>	 (I also restarted the datanode just in case to double check)
[12:12:14] <joal>	 no horror messages from daemons - that's unusual :)
[12:12:53] <elukey>	 what we can do now, to rollback as early as possible if needed, is to keep going with journal nodes
[12:13:00] <elukey>	 I don't think we need safe mode on at this point
[12:13:13] <klausman>	 I defer to your experience in this matter :)
[12:13:38] <joal>	 +1 for journal nodes first elukey - ok for no safemode
[12:14:21] <elukey>	 supe
[12:14:24] <elukey>	 *super
[12:14:43] <elukey>	 so I am doing tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log on an-master1001 to see what the hadoop hdfs namenode (active) thinks about what I am doing
[12:15:30] <elukey>	 in theory there is a 1:1 connection from the namenode to each journal daemon, and the namenode should already trust the puppet ca, so a mixed journalnode tls cluster should be ok (a lot of shoulds I know)
[12:16:39] <elukey>	 2020-09-24 12:15:51,721 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 10.64.5.27:8485 failed to write txns 4756473190-4756473190. Will try to write to this JN again after the next log roll.
[12:16:44] <elukey>	 2020-09-24 12:16:29,896 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Restarting previously-stopped writes to 10.64.5.27:8485 in segment starting at txid 4756473490
[12:17:14] <elukey>	 tres bien
[12:17:35] <elukey>	 no TLS "OH NOOESSS THIS CERTIFICATE IS NOT GOOOD"
[12:18:40] <elukey>	 proceeding with another journal
[12:21:00] <elukey>	 restarted the third journal, majority of the qjm, if the namenode doesn't like this it will shutdown
[12:21:08] <elukey>	 but seems that we are good
[12:22:27] <elukey>	 proceeding with the other two journals
[12:22:44] <klausman>	 Is it just me or is that hdfs log spammy as hell
[12:23:53] <elukey>	 yeah I am doing | grep -i journal sorry
[12:24:04] <elukey>	 I added it after a bit to dump the spam
[12:24:39] <elukey>	 klausman: /var/log/hadoop-hdfs/hdfs-audit.log is very interesting on an-master1001, good to check while you wait for me :)
[12:28:46] <elukey>	 all right all journals done
[12:30:45] <joal>	 elukey: ongoing job doesn't see any problem so far :)
[12:31:56] <elukey>	 next step, yarn nodemanagers restart
[12:31:57] <elukey>	 sudo cumin -m async 'A:hadoop-worker and not A:hadoop-hdfs-journal' 'enable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663"' 'rm /etc/hadoop/conf/ssl-client.xml' 'run-puppet-agent' 'systemctl restart hadoop-yarn-nodemanager'
[12:32:30] <elukey>	 the /etc/hadoop/conf/ssl-client.xml file is not needed anymore, and causes some confusion on some daemons (puppet doesn't deploy it anymore, but it doesn't remove it either)
[12:32:34] <elukey>	 in batch of say 5
[12:33:07] <elukey>	 started
[12:36:18] <elukey>	 20% of the hosts now
[12:36:31] <elukey>	 klausman: questions so far? Doubts about my mental sanity? 
[12:36:55] <elukey>	 (the latter is something that kormat often brings up so this is why I am asking)
[12:37:55] <klausman>	 Questions no. No doubts either (interpret the latter as you will :))
[12:38:27] <klausman>	 As for kormat and sanity. "It takes one to know one" or something
[12:39:35] <elukey>	 hahahaha
[12:41:48] <elukey>	 joal: as FYI I opened https://github.com/cloudera/hue/issues/1272 for Hue
[12:42:33] <joal>	 ack elukey
[12:42:59] <elukey>	 I'll ask to the team if somebody can debug that error and see if there is an easy fix
[12:43:13] <elukey>	 maybe milimetric  :)
[12:43:25] * elukey tries to deliberately nerd snipe DAn
[12:43:44] <joal>	 elukey: I wonder if it's not that there is no more green bar, but the red one exists it seems - no?
[12:44:09] <elukey>	 joal: I don't see the red bar either for failed jobs
[12:44:28] <elukey>	 I added two pics, the one on the top is hue-next (without the bar
[12:44:43] <elukey>	 Maybe I need to format the bug report in a better way
[12:45:01] <joal>	 elukey: Ah! makes sense
[12:45:19] <joal>	 elukey: it's me not having understood there were two pictures
[12:45:56] <elukey>	 I tried to separate those, you are right
[12:46:08] <elukey>	 all yarn nodemanagers restarted
[12:46:18] <joal>	 elukey: I found a way to access the info - using the 'task' tab - but it's less obvious
[12:46:26] <elukey>	 yeah :(
[12:46:39] <elukey>	 ok proceeding with HDFS Namenodes and Yarn Resource managers
[12:47:01] <elukey>	 klausman: there is only one master active at the time, and they establish who via zookeeper
[12:47:14] <elukey>	 elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs yarn rmadmin -getServiceState an-master1001-eqiad-wmnet
[12:47:17] <elukey>	 active
[12:47:20] <elukey>	 elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs yarn rmadmin -getServiceState an-master1002-eqiad-wmnet
[12:47:23] <elukey>	 standby
[12:47:30] <elukey>	 elukey@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
[12:47:33] <elukey>	 active
[12:47:35] <elukey>	 elukey@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
[12:47:38] <elukey>	 standby
[12:47:44] <elukey>	 so first yarn, then hdfs statuses
[12:47:59] <elukey>	 I am going to apply the changes to the standby node, restart, and then do the failover
[12:48:08] <elukey>	 to in case something fires up, I'll failback
[12:48:14] <elukey>	 ok joal --^ ?
[12:49:51] <elukey>	 (proceeding)
[12:50:16] <elukey>	 namenode on an-master1002 restarted
[12:50:20] <joal>	 ok elukey 
[12:50:43] <elukey>	 klausman: it takes a bit for a namenode to stabilize, especially from the jvm metrics point of view, so I'll wait 5 mins
[12:51:17] <klausman>	 Ack.
[12:52:07] <elukey>	 the major pain for the jvm is loading all the inodes 
[12:52:12] <klausman>	 close to 47 million inodes. Not bad
[12:52:36] <klausman>	 Speaking of: https://twitter.com/danvet/status/1309057488554254337
[12:52:41] <joal>	 klausman: 1/3 of that is for almost empty-blocks
[12:52:51] <elukey>	 I asked to bump the ram on those nodes to 128G (+64), I think we'll have to increase the heap size of the hdfs namenode to 32G soon
[12:52:52] <klausman>	 No surprise there.
[13:00:14] <elukey>	 ok so the standby namenode is stabilized, going to failover
[13:00:30] <elukey>	 sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
[13:02:56] <elukey>	 joal: at this point I'd complete the restart + failback, hdfs looks good from my perspective
[13:05:24] <elukey>	 (proceeding)
[13:08:14] <elukey>	 ahhh joal!
[13:08:49] <elukey>	 after restarting the mapreduce history server I noticed a ton of logs related to drops for /var/log/yarn/etc..
[13:08:52] <elukey>	 of course!
[13:09:08] <elukey>	 I didn't restart it when I applied the change for the 90->40 days
[13:09:13] <elukey>	 I restarted the yarn RMs!
[13:09:25] <elukey>	 so it is now dropping a lot of data
[13:09:32] <elukey>	 (logs)
[13:10:26] <klausman>	 Is it just me or does it feel good to get rid of old data that you are confident you never would've looked at anyway?
[13:10:27] <joal>	 elukey: Makes sense !!!!
[13:11:20] <elukey>	 klausman: oh yes it does, but I was confused why it didn't happen before, and I thought it didn't work!
[13:11:23] <elukey>	 now I feel better
[13:11:27] <joal>	 :)
[13:13:20] <elukey>	 I'll wait for the hdfs NN on 1001 to fully recover to failback
[13:18:58] <elukey>	 all right we are done
[13:19:46] <klausman>	 Ok, will update maint' page
[13:20:07] <klausman>	 and done
[13:20:13] <joal>	 elukey: wikitext job didn't even blink :)
[13:20:39] <elukey>	 !log re-enable timers on an-launcher1002 after maintenance
[13:20:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:21:29] <elukey>	 joal: all due to the magic of John - adding the puppet CA to the default truststore allowed the new certs of the shufflers to be trusted automatically
[13:21:53] * joal both loves and is afraid of magic
[13:24:15] <elukey>	 !log moved the hadoop cluster to puppet TLS certificates
[13:24:16] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:25:21] <Pchelolo>	 ottomata: what's out retention for jobs in kafka-main? 1 week?
[13:25:34] <ottomata>	 Pchelolo:  ya i think so
[13:25:39] <Pchelolo>	 oh dang.. 
[13:25:43] <ottomata>	 default is one week and i dont' think we've changed that
[13:26:02] <elukey>	 2M+ files deleted by the new log rules - https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=28&orgId=1
[13:26:38] <Pchelolo>	 can you bump it to 1 month for *.mediawiki.job.processMediaModeration ?
[13:26:51] <Pchelolo>	 I've accidentally lost my shell access to kafka boxes )
[13:27:24] <ottomata>	 sure
[13:28:17] <ottomata>	 Pchelolo:  the retry topics too?
[13:28:27] <Pchelolo>	 no, the retry doesn't matter
[13:28:31] <Pchelolo>	 only the main topics
[13:30:04] <milimetric>	 hahahaha, elukey what you did there is the equivalent of the bullet that fell from the sky in The Mexican
[13:30:19] <milimetric>	 because hue uses... (drum roll)... knockout!!!!
[13:30:29] <milimetric>	 what!?
[13:30:47] <milimetric>	 so yeah, I can fix this, but I may end up wanting to rewrite the whole thing
[13:32:04] <elukey>	 ahahahaha
[13:32:37] <ottomata>	 Pchelolo: 
[13:32:40] <ottomata>	 https://www.irccloud.com/pastebin/bd7ce4dE/
[13:32:42] <Pchelolo>	 thank you!
[13:32:54] <Pchelolo>	 that solves one little mystery
[13:33:10] <ottomata>	 Pchelolo:  what's the mystery?
[13:33:27] <Pchelolo>	 that the changeprop just stopped running
[13:33:30] <Pchelolo>	 for that job
[13:33:41] <Pchelolo>	 the answer is simple - the jobs vanished
[13:34:06] <Pchelolo>	 this one is used to submit a big chunk of events and then wait for them to process
[13:34:17] <Pchelolo>	 in this case - wait more then a week
[13:36:34] <ottomata>	 oh
[13:43:53] * klausman will be afk for 30m-45m, running an errand
[13:48:49] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6485312, @Ottomata wrote: >> The long-term answer (which might be stream processing stuff?) > is stream processing stuff >  >> In the very short ter...
[14:00:37] <milimetric>	 elukey: ok, you nailed it, that error prevents the bars from being set further down in that same function.  I think you're a JS developer now...
[14:01:39] * klausman back
[14:02:15] <klausman>	 Btw, the PSU error on kakfa-jumbo1008 should be gone, the DC people seated a wobbly power cable
[14:03:14] <ottomata>	 yay ty!
[14:03:24] <elukey>	 milimetric: :O :O
[14:03:27] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient?
[14:06:51] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6491422, @JAllemandou wrote: > Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster...
[14:10:11] <milimetric>	 elukey: do you have a fork of this somewhere that I should do the PR from or should I do my own?
[14:13:06] <elukey>	 milimetric: I have a fork under my user in gh, but probably best if you just fork and send.. can we live test the fix on hue-next?
[14:13:44] <milimetric>	 sure, lemme make the commit, I can PR it to your fork if you want?
[14:14:43] <milimetric>	 holy crap do I have to do all this Review Board stuff? https://docs.gethue.com/developer/development/#setup
[14:16:01] <klausman>	 That's not exactly a low barrier of entry.
[14:16:17] <klausman>	 But it sounds like you might get away with a GH PR if the patch is simple enough
[14:16:59] <elukey>	 milimetric: nono just fork cloudera/hue on gh and send the pr
[14:17:12] <elukey>	 I already sent 5 of them
[14:17:30] <elukey>	 best if you PR from your repo, mine is a little messed up
[14:18:09] <milimetric>	 they've got a whole process there... it seems like we should follow it... well, anyway
[14:18:25] <milimetric>	 I meant I can send you the commit and we can live test with it (or you gotta tell me how to apply it)
[14:18:56] <klausman>	 The second sentence says: "For more complex patches it's advisable to use RB than a plain pull request on github." which kinda sounds like simple stuff is ok as GH PR
[14:19:01] <elukey>	 ah so I manually change files on an-tool1009, I know it is not pretty but quicker for small fixes
[14:22:07] <elukey>	 ottomata: o/ - https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647/10/modules/profile/manifests/hadoop/common.pp - wdyt?
[14:22:51] <elukey>	 no regex.yaml, etc..
[14:23:22] <elukey>	 only one parameter, and we'd drop a ton of per-host configs (to change the partition list when a disk is broken for example)
[14:34:37] <ottomata>	 elukey:  whaajaaa????? cool wait factor partitions????
[14:34:38] <ottomata>	 looking
[14:35:10] <ottomata>	 elukey:  THAT IS SO COOL!
[14:35:11] <elukey>	 yeah1
[14:35:13] <elukey>	 !!!
[14:35:18] <ottomata>	 great idea
[14:35:35] <ottomata>	 add some good param docs for that explaining that it is filtering by that as mount path
[14:35:38] <ottomata>	 so the disk had better be mounted!
[14:35:41] <ottomata>	 but that is very cool
[14:35:47] <elukey>	 super doing so
[14:36:58] <elukey>	 this is nice https://puppet-compiler.wmflabs.org/compiler1003/25393/analytics1034.eqiad.wmnet/index.html
[14:37:13] <elukey>	 on 1034 (test cluster) I have probably not have kept track of all disks broken
[14:37:23] <elukey>	 and it is telling me that 2 are not mounted
[14:38:19] <elukey>	 but j seems to be there, mmm need to double check
[14:38:26] <elukey>	 the prod workers are all good afaics
[14:42:15] <ottomata>	 cool
[14:43:55] <elukey>	 ah no super weird
[14:43:55] <elukey>	   /dev/sdj1 => {
[14:43:55] <elukey>	     filesystem => "ext4",
[14:43:56] <elukey>	     mount => "/var/lib/hadoop/data/k",
[14:44:06] <elukey>	  /o\
[14:44:11] <elukey>	 this is on 1034
[14:44:15] <elukey>	 okok it is right
[14:44:17] <elukey>	 uff
[14:44:30] <elukey>	 the hadoop test is ... a little messed up
[14:55:55] <wikibugs>	 10Analytics-Radar, 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Platform Team Initiatives (Revision Storage Schema Improvements), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui)
[14:57:58] <milimetric>	 elukey: ok, where on an-tool1009 is this thing?
[14:58:54] <elukey>	 all the files are under /usr/lib/hue
[14:59:03] <elukey>	 but I am not sure if you can modify those
[14:59:09] <elukey>	 if you gimme the diff I can try
[14:59:12] <elukey>	 and restart
[15:00:01] <wikibugs>	 (03CR) 10Mforns: [V: 03+2] "OK, I finished the suggested changes and tested this with real data." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns)
[15:00:55] <mforns>	 ottomata: I think this should fix the mediawiki_job deletion script ^ feel free to review, sorry it took me more than I thought
[15:01:37] <nuria>	 ping milimetric ottomata mforns fdans 
[15:02:19] <nuria>	 ping joal 
[15:02:36] <milimetric>	 mako changes should get picked up, let's see
[15:02:52] <elukey>	 I can force a restart
[15:02:55] <elukey>	 already changed?
[15:03:03] <milimetric>	 ah, not git
[15:03:11] <milimetric>	 no, I have a patch... I guess I can just edit
[15:03:24] <elukey>	 yeah it is a debian package deployed basically
[15:04:44] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) > Currently these reports are going to Logstash; I don't think there's any refinement possible there? Not the refinement we do usually on the cluster indeed....
[15:10:14] <milimetric>	 elukey: those are all owned by root :( can you give me write permissions?
[15:12:01] <elukey>	 milimetric: what is the diff? I can apply it on the fly
[15:14:55] <milimetric>	 elukey: /home/milimetric/patch.diff on an-tool1009
[15:15:00] <elukey>	 super
[15:18:57] <wikibugs>	 10Analytics, 10Event-Platform: EventStreams error in logs: Error: Invalid number of arguments (for prometheus?) - https://phabricator.wikimedia.org/T263759 (10Ottomata)
[15:19:21] <wikibugs>	 10Analytics, 10Event-Platform: EventStreams error in logs: Error: Invalid number of arguments (for prometheus?) - https://phabricator.wikimedia.org/T263759 (10Ottomata)
[15:46:34] <icinga-wm>	 PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 5 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen
[15:47:18] <elukey>	 5 is not problem, will check why
[15:47:23] <elukey>	 * no problem
[16:21:30] <klausman>	 now it's 9
[16:24:17] <elukey>	 very strange
[16:26:33] <elukey>	 so everytime it happened in the past it eventually auto-resolved
[16:26:54] <elukey>	 we can try to check/list the corrupt blocks on the namenode
[16:31:09] <ottomata>	 klausman: elukey razzi ops sync here?
[16:31:09] <ottomata>	 https://meet.google.com/ako-cdgs-mmw
[16:31:19] <elukey>	 yep coming sorry, got some water
[16:31:34] <ottomata>	 thirst is no excuse for tardiness, i'm marking this down in your records
[16:34:20] <wikibugs>	 (03PS8) 10Milimetric: [WIP] Add filter/split component to Wikistats TODO: I'm pushing this even though I haven't tested it in the UI at all, because it passes tests and I find that interesting, I want to see the difference between this and the eventually working version, to see if I miss adding any tests as I make it work.  Basically, this just tries to remove references to "breakdown" from everywhere. [analytics/wikistats2] 
[16:34:20] <wikibugs>	 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans)
[16:37:43] <wikibugs>	 (03PS9) 10Milimetric: Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans)
[16:43:56] <joal>	 Came accross that the other day - thought it would be worth sharing: https://backstage.io/
[16:49:01] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) We need to deploy this change , see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats_2#Contributing_and_Deployment
[16:49:44] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) a:05paulkernfeld→03razzi
[16:49:59] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) Assigning to razzi for deploy
[16:50:27] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) cc @fdans so he knows this link was added
[17:28:31] <elukey>	 running a fsck / on hdfs to see if I get more info about those corrupt blocks
[17:30:43] <elukey>	 mforns: o/ - are the refine failures already checked/handled?
[17:31:00] <elukey>	 (asking since we have the usual icinga alert open for refine flags failures)
[17:31:01] <mforns>	 elukey: let me see
[17:31:29] <mforns>	 elukey:  no, I'll do, sorry
[17:32:01] <elukey>	 mforns: nono please don't say that, I was just checking alarms :)
[17:32:21] <mforns>	 yes, don't worry!
[17:32:21] <elukey>	 milimetric: the change is live on an-tool1009
[17:32:40] <elukey>	 mforns: ah also when you have a moment I can explain the oozie admin test that I have in mind
[17:32:51] <mforns>	 elukey: ok
[17:33:32] <milimetric>	 elukey: oh looks good: https://hue-next.wikimedia.org/hue/jobbrowser/jobs/0009826-200915132022208-oozie-oozi-W#!id=0009824-200915132022208-oozie-oozi-W
[17:33:48] <milimetric>	 (you get a bar, though I think it's wrong? and you get no more error)
[17:34:37] <milimetric>	 there's definitely a bunch of bugs on that page
[17:34:53] <milimetric>	 there are some random ghost arrows on the left floating around and the bottom two boxes are just nonsense
[17:35:12] <milimetric>	 that's more fundamental... I don't think the fix we did helps you at all, does it?
[17:35:13] <elukey>	 so I checked webrequest-load-wf-text-2020-9-21-21, that failed while doing add_partition
[17:35:25] <elukey>	 and it has a green bar at the end, but no red bar at the top
[17:35:40] <elukey>	 the error in the logs is gone though
[17:35:51] <milimetric>	 hm... maybe the status isn't error anymore
[17:36:22] <milimetric>	 I know while I was testing the thing it was failing on had a status of ERROR
[17:36:36] <milimetric>	 and when I applied the change manually, it set a red bar, but I don't remember where
[17:36:43] <elukey>	 I see another one in the console "POST https://hue-next.wikimedia.org/metadata/api/catalog/list_tags 500"
[17:36:55] <elukey>	 but I can't find traces of that in the logs of hue (that 500 I mean)
[17:38:12] <milimetric>	 yeah, it just looks like serious problems on this page
[17:38:20] <elukey>	 klausman, joal  - this is the fsck result https://phabricator.wikimedia.org/P12795 - so I think that the namenode's jmx metrics have a bug
[17:38:22] <milimetric>	 that 500 sounds like they're making a bad request
[17:38:48] <milimetric>	 I mean... I can't even click on any of the subworkflows, they throw an error
[17:39:10] <elukey>	 ah snap I didn't see those 400s
[17:39:12] <milimetric>	 I'm gonna abandon this one... maybe I'll just comment on the issue
[17:40:19] <elukey>	 on the logs I see 
[17:40:19] <elukey>	 desktop.lib.rest.http_client.RestException: 400 Client Error: Bad Request for url: http://an-coord1001.eqiad.wmnet:11000/oozie/v1/job?doAs=elukey&timezone=UTC
[17:40:27] <elukey>	 u>The request sent by the client was syntactically incorrect.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/6.0.53</h3></body></html> (error 400)
[17:40:38] <elukey>	 that is weird
[17:42:42] <elukey>	 ahhh and on the oozie side
[17:42:43] <elukey>	 java.lang.IllegalArgumentException: id cannot be empty
[17:43:22] <elukey>	 so I bet money that hue-next is tailored for oozie 5's api
[17:48:53] <elukey>	 ah yes I have a lead on the code, sigh
[17:57:46] <ottomata>	 mforns:  i see you are looking inot (or looked into) that refine error?
[17:57:50] <ottomata>	 did you find out what went wrwong?
[17:57:58] <ottomata>	 the error looks like maybe someone deleted the schema again
[17:58:07] <mforns>	 the CirrusSearch one?
[17:58:31] <ottomata>	 no CitationUsage
[17:59:06] <ottomata>	 yup https://meta.wikimedia.org/w/index.php?title=Schema:CitationUsage&action=history
[17:59:09] <ottomata>	 i'll add it to exclude list
[17:59:23] <mforns>	 ottomata: ok
[18:00:42] <elukey>	 milimetric: opened https://github.com/cloudera/hue/issues/1273
[18:01:47] <ottomata>	 mforns:  can you and razzi  finalize the deletion stuff, maybe next week?
[18:01:47] <ottomata>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/628895
[18:01:56] <ottomata>	 I don't feel comfortable merging that today
[18:02:05] <mforns>	 ottomata: sure no problem
[18:02:06] <ottomata>	 and i guess your script changes needs a refinery deploy anyway
[18:02:08] <ottomata>	 ya?
[18:02:17] <mforns>	 yes, I was going to do that today
[18:02:21] <ottomata>	 ok cool
[18:02:25] <mforns>	 but yea, for you, it's to hasty
[18:02:32] <ottomata>	 thank you!
[18:02:48] <mforns>	 when will you be back?
[18:02:58] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/upload - https://phabricator.wikimedia.org/T263377 (10SecretName101)
[18:03:00] <ottomata>	 you guys can get the checksum together, verify the delete will do the right thing, maybe do the first run manually, and razzi can merge puppet
[18:03:05] <ottomata>	 monday oct 12th
[18:03:18] <mforns>	 ok, yes, we can do that
[18:03:51] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/uploads - https://phabricator.wikimedia.org/T263377 (10SecretName101)
[18:04:45] <razzi>	 Cool. mforns: want to set aside time on Monday to do that? Apacheconf will be going on most of the rest of next week
[18:05:49] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) Testing idea: on analytics1030 (test cluster, where oozie runs) we have:  ` elukey@analytics1030:~$ sudo cat /etc/oozie/conf/adminusers.txt # Admin Users, one user...
[18:06:02] <elukey>	 mforns, razzi - https://phabricator.wikimedia.org/T262660#6492160
[18:06:12] <elukey>	 lemme know if it makes sense
[18:06:25] <mforns>	 lookin
[18:07:07] <mforns>	 elukey: yes, that makes total sense
[18:07:53] <elukey>	 should be very easy to test
[18:07:57] <mforns>	 we'll have to prepare a bit the terrain for the webrequest job to write its output to a separate place owned by analytics-privatedata
[18:08:13] <mforns>	 the job writes to 3 different tables
[18:08:23] <mforns>	 but yes, we'll manage
[18:08:24] <elukey>	 could also be any other job, not webrequest
[18:08:26] <elukey>	 even a simpler one
[18:08:35] <mforns>	 there's not a lot of data there
[18:08:53] <elukey>	 IIRC joseph a while launched the pageview one
[18:09:20] <elukey>	 but feel free to also kill/reuse the running webrequest job
[18:09:28] <elukey>	 I run it via as script in my home
[18:09:53] <elukey>	 mforns: /home/elukey/launch_webrequest_bundle.sh on an-tool1006
[18:09:56] <mforns>	 elukey: we already ran it yesterday
[18:10:25] <elukey>	 yeah my point is that you don't need to recreate different tables etc.. if you don't really want
[18:10:35] <elukey>	 you can use the existing ones, like kill the current coord etc..
[18:10:37] <mforns>	 but to run it under analytics-privatedata and not fail when writing data, we have to create a couple tables and setup a couple directories, not a big deal, but still todo
[18:10:38] <elukey>	 feel free to do it
[18:10:50] <elukey>	 mforns: why privatedata? You can use analytics no?
[18:11:07] <elukey>	 I mean, exactly what it is running now
[18:11:36] <mforns>	 oh I see...
[18:11:52] <elukey>	 I mean, I don't really care about those :D
[18:12:33] <mforns>	 hehe, I understand now, maaaan, I'm really bad in ops
[18:12:55] <elukey>	 you can kill it / restart it anytime, eventually I'll restart it just to have some continuity and catch bugs, but data holes are really not important in there
[18:13:18] <elukey>	 ahahahah nono it is something that I only use that it would require some "standardization"
[18:13:21] <mforns>	 ok, easier then! we can do this today if razzi agrees
[18:13:22] <elukey>	 my bad on this side
[18:13:45] <elukey>	 then if it works it is a matter of following up with others
[18:14:00] <mforns>	 yea, makes sense!
[18:14:02] <elukey>	 and eventually deploy the change in prod
[18:14:06] <elukey>	 super :)
[18:14:10] <mforns>	 :]
[18:14:38] <elukey>	 all right thanks! Logging off for today 
[18:14:40] <elukey>	 o/
[18:14:58] <mforns>	 oh razzi btw I missed your last message: yes we can work on the deletion script on monday
[18:18:23] <ottomata>	 nuria:  i made some suggested edits to your blog posts, looks great!
[18:18:31] <ottomata>	 feel free to ignore any or all of them :)
[18:20:34] <wikibugs>	 10Analytics, 10Platform Engineering, 10Epic, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo)
[18:34:55] <razzi>	 mforns: I'm down to look at the oozie stuff today, let me know when works for you
[18:35:30] <mforns>	 hey razzi :] yes, now works for me, or else in 90 mins, what do you prefer?
[18:35:47] <razzi>	 mforns: Let's go for it now
[18:35:51] <mforns>	 k! omw
[18:36:25] <ottomata>	 :) <3
[18:58:33] <nuria>	 ottomata: super super thanks
[19:00:40] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) assigning to @mforns
[19:29:54] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10razzi) Confirmed with @mforns that adding to the bundle.properties  `oozie.job.acl = <group that user belongs to>` (in this case wikidev)  Allows administering jobs via th...
[19:57:18] <wikibugs>	 10Analytics-Clusters, 10Operations, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10RobH) I'm removing the #ops-eqiad tag, as this is hurting their open task metrics when its never actually been within their ability to move this forward.  When thi...
[20:05:51] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10mforns) I think we've discussed this before, but just for the record: I think one important aspect of the sanitization config is that changes to...
[20:32:34] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) @razzi have you tried to add `analytics-privatedata-users` as `oozie.job.acl` and see if it works? In theory it should, IIUC the group membership is checked using...
[20:39:27] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10razzi) @elukey we did try that, and it didn't work. It's possible we misconfigured something; could give that another try.
[20:39:49] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Thanks @Ottomata for your help in getting things working today...
[20:47:13] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Thanks @Ottomata for your help in getting things working today...
[20:56:05] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Also, since SearchSatisfaction is still a legacy schema, it ma...
[20:59:49] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Mholloway) Hey @Ottomata (and @egardner!), I'm still catching up here bu...
[21:01:05] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Mholloway) Oh, I think @jlinehan wanted to give it another looks as well.
[21:01:20] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10srodlund)
[21:01:30] <wikibugs>	 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10srodlund) 05Open→03Resolved This is published! Note I went with an image of a stile from Commons as opposed to Flickr.  https://techblog.wikimedia.org/2020...
[21:01:56] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10jijiki) @Milimetric that is fine, take your time and thank you!
[21:11:13] <wikibugs>	 10Analytics, 10Platform Engineering, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo)
[21:21:31] <wikibugs>	 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Ottomata) So, I'm disappointed that we need EventStreamConfig set up for...
[22:04:35] <wikibugs>	 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Nuria) @JAllemandou I think adding geo info (or rather swapping IP by Geo info ) is something that would need to happen in this case (in the absence of stream processing b...
[22:09:12] <wikibugs>	 (03CR) 10Nuria: [C: 03+2] Set timeout in oozie dumps-dependent jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/629073 (https://phabricator.wikimedia.org/T263529) (owner: 10Joal)
[22:09:15] <wikibugs>	 (03CR) 10Nuria: [V: 03+2 C: 03+2] Set timeout in oozie dumps-dependent jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/629073 (https://phabricator.wikimedia.org/T263529) (owner: 10Joal)
[22:24:06] <wikibugs>	 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Nuria) +2 to @mforns comments
[23:46:02] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10User-Elukey: Create temporary cluster to hold a copy of data for backup purposes - https://phabricator.wikimedia.org/T263814 (10Nuria)