[01:48:15] PROBLEM - Check the last execution of monitor_refine_event_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:09:29] goood morning [06:17:46] lovely we crossed the 2PB mark on hdfs :( [06:39:03] !log manually ran "/usr/bin/find /srv/backup/hadoop/namenode -mtime +15 -delete" on an-master1002 to free some space in the backup partition [06:39:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:39:58] we keep around 20 days of hdfs fsimages (4.2G each nowadays), plus the lvs backup for the analytics-meta instance, that is around 50G [06:40:16] I need to finish the work on the new backup infra to remove this extra backup on the namenode [06:41:40] well back to HDFS, we crossed the 2PB mark [06:41:51] and some workers are showing partitions getting filled [06:42:20] I'll try to add asap those 6 nodes to increase space, but i am not ready yet, so some space needs to be freed [06:45:19] !log on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs/* [06:45:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:47:38] so I am checking the most used log dirs for yarn logs [06:47:38] elukey@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /var/log/hadoop-yarn/apps/mirrys/logs [06:47:41] Found 106 items [06:47:44] drwxrwx--- - mirrys mapred 0 2020-07-14 11:52 /var/log/hadoop-yarn/apps/mirrys/logs/application_1592377297555_131595 [06:47:47] ... [06:47:58] given the timestamp, our 40 days drop config doesn't seem to work [06:48:48] on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/* [06:48:58] !log on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/* [06:48:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:49:43] dropped ~100T [07:13:07] 10Analytics: Check home/HDFS leftovers of jkumarah - https://phabricator.wikimedia.org/T263715 (10MoritzMuehlenhoff) [07:21:38] elukey: Morning! Any last things I should be aware of before reimaging 1006? [07:22:06] klausman: morning! nope, please proceed [07:29:43] !log Starting reimaging of stat1006 [07:29:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:32:07] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1006.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-... [07:39:13] joal: bonjour - before stopping timers etc.. lemme know if it is a good time this morning to do the TLS maintenance for hadoop [07:39:50] we could in theory not even go into safe mode [07:42:31] elukey: btw, will puppet break this time around again? I found https://wikitech.wikimedia.org/wiki/Puppet#Reinstalls but it that seems to indicate that no manual intervention should *nromally* be necessary [07:44:56] klausman: I hope not, we need to see if the bug is fixed [07:45:04] Roger [07:45:16] First boot of new install is about to happen [07:45:46] super [07:46:03] and we're back [07:46:57] the first puppet runs will take a bit, but if the reimage script doesn't break it should mean that the old bug is fixed [07:47:09] Ack [07:47:14] there are still some issues here and there but should be ok [07:48:03] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1006.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1006.eqiad.wmnet'] ` [07:48:28] 07:47:51 | stat1006.eqiad.wmnet | Unable to run wmf-auto-reimage-host: Failed to puppet_first_run [07:48:30] Welp. [07:49:06] Was there anything to do beyond the steps on the wikipage I linked? [07:50:01] that part is taken care by wmf-auto-reimage, no need to execute those commands [07:50:18] But it said the first run failed? [07:50:44] yes but it is due to the puppet code, still that bug that I hoped to have fixed [07:51:01] I'm confused. [07:51:39] As I understood it, the first puppet run triggered by the install script should work. [07:51:59] But here, it didn't. So is there anything that needs doing? [07:52:18] yep yep with the assumption that the puppet code for the role works in every condition, like on a fresh node [07:52:44] so in theory, best case scenario, no puppet code bugs etc.. and the first puppet run works [07:53:22] often (and this is also a good consequence of reimaging) the puppet code has bugs and the first puppet run shows them [07:53:29] for example, let's take this one [07:53:49] if you go on puppetmaster1001 and use install_console, then /var/log/puppet.log should show [07:53:57] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed getting spark version via facter. (file: /etc/puppet/modules/profile/manifests/hadoop/spark2.pp, line: 101, column: 9) on node stat1006.eqiad.wmne [07:54:16] that is the same bug we had with 1004 [07:54:47] so wmf-auto-reimage tried to kick off a puppet run, but the return code was non zero so it stopped [07:55:05] Ok. So the manual fix for 1006 is to install spark, so facter sees something, then do a puppet run? [07:55:28] yes sadly, since for some reason that code is not completely working [07:55:43] another alternative, that surely works, is to use hiera's lookup [07:55:44] (brb) [07:55:50] we could fix it now and see if it works [07:56:01] Btw, there is no puppet log on 1006 [07:58:34] ah then just run puppet to see, I thought there was [08:00:26] It *is* the Spark failure [08:00:26] so if we want to unblock this now, we can just install spark2 via install_console, run puppet, and then reboot the host [08:01:07] How much work would testing a fix for the puppet role be? [08:02:11] if we add a hiera lookup, it is a puppet change that should take 10/15 mins, but we may want to follow up with andrew later on to figure out if he wants to fix the facter's code [08:02:36] in case he doesn't want, we can add the hiera lookup before 1007 and that's it [08:03:05] so manual fix for now, then puppet change before 1007 could be a good compromise [08:03:08] as you prefer [08:03:30] I agree with manual now, proper fix for 1007 [08:03:45] will install spark2, run puppet and then reboot [08:03:52] okok [08:04:12] are the defaults ok? (regarding suggests: etc) [08:06:06] klausman: what do you mean? [08:07:00] Sometimes, one wants to install a package but not the Suggests: and Recommends: stuff. It's something like –no-install-recommends [08:10:05] ahhh okok [08:10:31] I think spark2 is a big jar container, should be ok [08:11:08] (it doesn't bring more packages in etc..) [08:11:09] Alrighty [08:13:24] Man Puppet sure doesn't *look* fast [08:14:54] we install a ton of things on those nodes :( [08:18:35] klausman: I forgot one thing, namely https://phabricator.wikimedia.org/T262609 [08:19:15] so on all hosts puppet creates a new v6 interface with the last 64bits containing the v4 adddress mapped [08:19:38] on stat1004 we have for example inet6 2620:0:861:104:10:64:5:104/64 scope global mngtmpaddr dynamic [08:19:50] so we can put those as AAAA records etc.. [08:19:56] (rather than relying on autoconfig) [08:20:29] So far, puppet doesn't seem *stalled*, just slow [08:20:32] the first puppet run might get stuck at the stage when the v6 interface is added/changed, since if it uses a v6 connection it breaks [08:20:54] I'll keep an eye on the run, and do the ctrl-c and restart thing mentioned by Moritz [08:21:07] (if it gets stuck, that is) [08:21:09] yeah but it might happen that it gets stuck at that stage, if so either use the --source-address etc.. or ctrl+c etc.. as you said [08:21:12] perfect [08:29:20] klausman: I'd need to step away from keyboard for max 1h (I have to bring my car to do the bi-yearly check), ok if go? [08:32:16] sure [08:32:21] super thanks, ttl :) [08:41:34] 10Analytics, 10LDAP-Access-Requests, 10Operations: Grant access to archiva-deployers for mstyles - https://phabricator.wikimedia.org/T242624 (10hashar) [08:41:42] 10Analytics, 10LDAP-Access-Requests, 10Operations: Grant access to archiva-deployers for zpapierski - https://phabricator.wikimedia.org/T242622 (10hashar) [09:03:56] stat1006 is back, and all clear mail has been sent. [09:10:32] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10klausman) Reimaging complete. The failure above is the failed first run of puppet due to no spark being installed. I did that manually, ran puppet rebooted for the kernel opts a... [09:39:15] klausman: back [09:39:26] welcome back. [09:40:06] Should we just nuke the 1004 backup on 1008 now, and the 1006 one on Monday? (I *think* the 1007 backup might fit if 1004 is gone [09:41:03] !log force re-creation of jupyterhub's default venv on stat1006 after reimage [09:41:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:41:20] klausman: yep please drop the 1004 backup [09:41:39] Running.... [09:42:08] stat1006 looks good! [09:42:29] testing jupyterhub [09:42:42] excellent. one question: why was backing up to one of the labstore machines not considered for the /srv backups> [09:44:23] in theory those are not our hosts, they host different things [09:44:39] Alright. [09:45:08] there is also another option now that I think about. Joseph wrote a while ago a tool called hdfs-rsync, that mimics the rsync command but to/from hdfs [09:47:18] (tested a spark sql query via pyspark yarn notebook + kinit, all good on 1006) [09:47:21] I also just figured out that 1007 has 4.5T of data, but unless we delete the 1006 backup as well, we have no stat machine with enough space to backup that [09:47:46] I think that we can drop it just before starting the backup tomorrow [09:48:23] Yeah, makes sense. [09:50:51] I see that the convert_xml_to_parquet step of the mw-wikitext job is almost done, I may have a window to stop all the jobs and swap the tls certs [09:51:06] !log stop all timers on an-launcher1002 to ease maintenance [09:51:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:51:12] Just yell if want any help [09:51:53] yep yep it should be an easy procedure, but I have to do it in say ~1h due to cluster draining [09:52:04] I can explain what I am going to do if you are interested [09:52:12] Sure, I'll shoulder-surf [09:52:20] context is in https://phabricator.wikimedia.org/T253957 [09:52:55] basically Hadoop has a wide variety of auth/encryption protocols, added over the years, so fun to manage all of those () [09:53:12] TLS is used for two big things [09:53:33] 1) encrypt/auth traffic from the shufflers to the reducers (yarn) [09:53:51] 2) encrypt/auth traffic from the HDFS Namenode to the journalnodes [09:54:53] we use a self signed CA and certs created via https://wikitech.wikimedia.org/wiki/Cergen currently, and instead of keep renewing the CA etc.. we thought to use puppet host level TLS certs [09:55:04] John in SRE did two things [09:55:16] 1) add the puppet CA to the default truststores of all the JVMs [09:55:44] 2) add some defines/classes to be able to wrap the puppet tls host's pem into a pkcs12 keystore [09:56:02] the maintenance is about swapping the config [09:56:37] case 1), the shufflers, is easy since it just need a roll restart of the node managers (that can happen anytime without affecting jobs etc..), but the cluster is better to be drained first [09:56:51] the second case, hdfs journanodes, is a little bit more delicate [09:57:11] the journal nodes are meant to keep a replicated HDFS edit log [09:57:36] that is basically a stream of changes to HDFS, not yet packed/compacted into one fsimage [09:57:42] Ah, so just shooting them in the head is very ill-advised [09:58:57] we can roll restart them in theory, one at the time, and a mixed TLS certs cluster should be ok since comms happens only from namenode (an-master*) to journal daemons (so the journal daemons don't really have any form of consensus algorithm, it is all handled by the namenode) [09:59:18] and the namenode now should trust the puppet CA [09:59:37] but we could play on the safe side and set something called "HDFS safe mode" [09:59:44] the fs goes in read only basically [10:00:24] Is that much slower/more work? [10:00:38] some seconds to enter/exit safe mode, it is one command [10:01:44] Then that sounds like a good idea [10:02:01] (very unrelatedly, just spotted something: systemd[1]: [/etc/systemd/system/user.slice.d/puppet-override.conf:7] Memory limit '0' out of range. Ignoring.) [10:02:35] That's MemorySwapMax=0 [10:04:12] ah interesting, IIRC we set no swap for the cgroups containing regular users on stat100x, but not sure if really needed [10:04:26] it's not working, at any rate :) [10:05:46] I am wondering if it is not working on buster, but it did on stretch [10:05:54] anyway, I think it is ok to remove it [10:06:03] do you want to file the puppet change klausman ? [10:07:00] 1007 does not have the message about the 0 limit [10:07:07] Yes, will do [10:08:26] thanks! [10:12:01] 10Analytics: Drop MemorySwapMax=0 from analytics puppet roles - https://phabricator.wikimedia.org/T263731 (10klausman) [10:15:39] *sigh* including a line longer than 100 characters is a big nono, even if it is a verbatim log message... [10:19:01] And merged [10:19:58] good [10:20:01] Who would I poke about modules/profile/files/toolforge/bastion-user-resource-control.conf? They have the same setting we have which won't work on Buster [10:20:09] we ahd* [10:20:14] arturo on IRC is the best poc [10:20:25] Ok, will give him a shout [10:23:00] 10Analytics: Drop MemorySwapMax=0 from analytics puppet roles - https://phabricator.wikimedia.org/T263731 (10klausman) 05Open→03Resolved [11:00:05] going afk for lunch, the timers are still disabled, waiting for the cluster to drain [11:00:08] (as FYI) [11:00:35] I didn't start any tls cert swap, so in case it is needed just re-enable puppet on an-launcher1002 and run it to restore all the jobs etc.. [11:00:38] * elukey lunch [11:17:30] Hi team - I'm sorry I'm super late today [11:17:47] Looking at the cluster drain now [11:27:37] here is my status: ongoing jobs are the mediawiki-wikitext-history one (let's not worry about it, it's not yet done) - and the pageview-histoirical backfilling [11:28:36] https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0036183-200915132022208-oozie-oozi-C [11:30:42] I think we're good to stop the cluster when you wish [11:37:47] 10Analytics, 10Analytics-Kanban: Improve mediawiki-wikitext spark job repartitioning - https://phabricator.wikimedia.org/T263736 (10JAllemandou) a:03JAllemandou [11:38:05] (03PS1) 10Joal: Update MediawikiXMLDumpsConverter repartitioning [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629659 (https://phabricator.wikimedia.org/T263736) [11:52:04] joal: o/ [11:52:08] I am here if you are [11:52:13] Hi elukey - I'm sorry for my lateness :S [11:52:45] nono please don't say that, there are plenty of things to do anyway as you well know :D [11:53:12] elukey: looking a bit more in detail to mediawiki-wikitesxt job allowed me to pinpoint an easy performance improvement (see task and patch just above) [11:53:26] elukey: we can start when you wish [11:54:23] ok prepping the change [11:56:27] also I just noticed there are actually 2 backfilling jobs for pageview-historical, the one above and that one: [11:56:33] https://hue-next.wikimedia.org/hue/jobbrowser/#!id=0054362-200915132022208-oozie-oozi-C [11:56:52] elukey: what's the best way to follow along? [11:57:53] klausman: I can write what I am doing in here at every step if you are ok [11:58:13] Sure. [11:58:20] the change is very simple - https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663/ [11:58:35] I just executed from cumin [11:58:35] sudo cumin 'c:profile::hadoop::common' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663" [11:58:45] just to be on the safe side [11:59:23] joal: ah so we need to wait? [12:00:24] they are doing the hive step, maybe better to wait [12:00:30] can we pause them afterwards? [12:00:59] no o, no need to wait [12:01:11] when steps are done, new steps start right awa [12:01:29] so let's not wait, I'll restart failed instances as needed [12:01:37] elukey, klausman --^ [12:02:00] aye aye [12:02:26] ack, prepping for the change [12:04:19] merged the puppet change [12:04:32] now I am going to test what happens on analytics1042 [12:04:55] mmm no better to start from a host with journal nodes [12:05:18] klausman: we have the list of journal nodes in hieradata/common.yaml -> hadoop_analytics [12:05:38] sorry hadoop_clusters -> analytics-hadoop [12:05:59] I pick analytics1052.eqiad.wmnet as testbed, so I can run puppet and restart both yarn and journalnode daemons [12:06:07] that should not return to me any horrible error messages [12:06:13] if so, I'll rollback [12:06:20] *nod* [12:06:59] * joal waits in wonder about possible horrible horror messages [12:08:25] in the distance, explosions and sirens [12:11:44] all right all daemons are good [12:11:54] (I also restarted the datanode just in case to double check) [12:12:14] no horror messages from daemons - that's unusual :) [12:12:53] what we can do now, to rollback as early as possible if needed, is to keep going with journal nodes [12:13:00] I don't think we need safe mode on at this point [12:13:13] I defer to your experience in this matter :) [12:13:38] +1 for journal nodes first elukey - ok for no safemode [12:14:21] supe [12:14:24] *super [12:14:43] so I am doing tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log on an-master1001 to see what the hadoop hdfs namenode (active) thinks about what I am doing [12:15:30] in theory there is a 1:1 connection from the namenode to each journal daemon, and the namenode should already trust the puppet ca, so a mixed journalnode tls cluster should be ok (a lot of shoulds I know) [12:16:39] 2020-09-24 12:15:51,721 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 10.64.5.27:8485 failed to write txns 4756473190-4756473190. Will try to write to this JN again after the next log roll. [12:16:44] 2020-09-24 12:16:29,896 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Restarting previously-stopped writes to 10.64.5.27:8485 in segment starting at txid 4756473490 [12:17:14] tres bien [12:17:35] no TLS "OH NOOESSS THIS CERTIFICATE IS NOT GOOOD" [12:18:40] proceeding with another journal [12:21:00] restarted the third journal, majority of the qjm, if the namenode doesn't like this it will shutdown [12:21:08] but seems that we are good [12:22:27] proceeding with the other two journals [12:22:44] Is it just me or is that hdfs log spammy as hell [12:23:53] yeah I am doing | grep -i journal sorry [12:24:04] I added it after a bit to dump the spam [12:24:39] klausman: /var/log/hadoop-hdfs/hdfs-audit.log is very interesting on an-master1001, good to check while you wait for me :) [12:28:46] all right all journals done [12:30:45] elukey: ongoing job doesn't see any problem so far :) [12:31:56] next step, yarn nodemanagers restart [12:31:57] sudo cumin -m async 'A:hadoop-worker and not A:hadoop-hdfs-journal' 'enable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/c/operations/puppet/+/629663"' 'rm /etc/hadoop/conf/ssl-client.xml' 'run-puppet-agent' 'systemctl restart hadoop-yarn-nodemanager' [12:32:30] the /etc/hadoop/conf/ssl-client.xml file is not needed anymore, and causes some confusion on some daemons (puppet doesn't deploy it anymore, but it doesn't remove it either) [12:32:34] in batch of say 5 [12:33:07] started [12:36:18] 20% of the hosts now [12:36:31] klausman: questions so far? Doubts about my mental sanity? [12:36:55] (the latter is something that kormat often brings up so this is why I am asking) [12:37:55] Questions no. No doubts either (interpret the latter as you will :)) [12:38:27] As for kormat and sanity. "It takes one to know one" or something [12:39:35] hahahaha [12:41:48] joal: as FYI I opened https://github.com/cloudera/hue/issues/1272 for Hue [12:42:33] ack elukey [12:42:59] I'll ask to the team if somebody can debug that error and see if there is an easy fix [12:43:13] maybe milimetric :) [12:43:25] * elukey tries to deliberately nerd snipe DAn [12:43:44] elukey: I wonder if it's not that there is no more green bar, but the red one exists it seems - no? [12:44:09] joal: I don't see the red bar either for failed jobs [12:44:28] I added two pics, the one on the top is hue-next (without the bar [12:44:43] Maybe I need to format the bug report in a better way [12:45:01] elukey: Ah! makes sense [12:45:19] elukey: it's me not having understood there were two pictures [12:45:56] I tried to separate those, you are right [12:46:08] all yarn nodemanagers restarted [12:46:18] elukey: I found a way to access the info - using the 'task' tab - but it's less obvious [12:46:26] yeah :( [12:46:39] ok proceeding with HDFS Namenodes and Yarn Resource managers [12:47:01] klausman: there is only one master active at the time, and they establish who via zookeeper [12:47:14] elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs yarn rmadmin -getServiceState an-master1001-eqiad-wmnet [12:47:17] active [12:47:20] elukey@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs yarn rmadmin -getServiceState an-master1002-eqiad-wmnet [12:47:23] standby [12:47:30] elukey@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet [12:47:33] active [12:47:35] elukey@an-master1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet [12:47:38] standby [12:47:44] so first yarn, then hdfs statuses [12:47:59] I am going to apply the changes to the standby node, restart, and then do the failover [12:48:08] to in case something fires up, I'll failback [12:48:14] ok joal --^ ? [12:49:51] (proceeding) [12:50:16] namenode on an-master1002 restarted [12:50:20] ok elukey [12:50:43] klausman: it takes a bit for a namenode to stabilize, especially from the jvm metrics point of view, so I'll wait 5 mins [12:51:17] Ack. [12:52:07] the major pain for the jvm is loading all the inodes [12:52:12] close to 47 million inodes. Not bad [12:52:36] Speaking of: https://twitter.com/danvet/status/1309057488554254337 [12:52:41] klausman: 1/3 of that is for almost empty-blocks [12:52:51] I asked to bump the ram on those nodes to 128G (+64), I think we'll have to increase the heap size of the hdfs namenode to 32G soon [12:52:52] No surprise there. [13:00:14] ok so the standby namenode is stabilized, going to failover [13:00:30] sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet [13:02:56] joal: at this point I'd complete the restart + failback, hdfs looks good from my perspective [13:05:24] (proceeding) [13:08:14] ahhh joal! [13:08:49] after restarting the mapreduce history server I noticed a ton of logs related to drops for /var/log/yarn/etc.. [13:08:52] of course! [13:09:08] I didn't restart it when I applied the change for the 90->40 days [13:09:13] I restarted the yarn RMs! [13:09:25] so it is now dropping a lot of data [13:09:32] (logs) [13:10:26] Is it just me or does it feel good to get rid of old data that you are confident you never would've looked at anyway? [13:10:27] elukey: Makes sense !!!! [13:11:20] klausman: oh yes it does, but I was confused why it didn't happen before, and I thought it didn't work! [13:11:23] now I feel better [13:11:27] :) [13:13:20] I'll wait for the hdfs NN on 1001 to fully recover to failback [13:18:58] all right we are done [13:19:46] Ok, will update maint' page [13:20:07] and done [13:20:13] elukey: wikitext job didn't even blink :) [13:20:39] !log re-enable timers on an-launcher1002 after maintenance [13:20:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:21:29] joal: all due to the magic of John - adding the puppet CA to the default truststore allowed the new certs of the shufflers to be trusted automatically [13:21:53] * joal both loves and is afraid of magic [13:24:15] !log moved the hadoop cluster to puppet TLS certificates [13:24:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:25:21] ottomata: what's out retention for jobs in kafka-main? 1 week? [13:25:34] Pchelolo: ya i think so [13:25:39] oh dang.. [13:25:43] default is one week and i dont' think we've changed that [13:26:02] 2M+ files deleted by the new log rules - https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=28&orgId=1 [13:26:38] can you bump it to 1 month for *.mediawiki.job.processMediaModeration ? [13:26:51] I've accidentally lost my shell access to kafka boxes ) [13:27:24] sure [13:28:17] Pchelolo: the retry topics too? [13:28:27] no, the retry doesn't matter [13:28:31] only the main topics [13:30:04] hahahaha, elukey what you did there is the equivalent of the bullet that fell from the sky in The Mexican [13:30:19] because hue uses... (drum roll)... knockout!!!! [13:30:29] what!? [13:30:47] so yeah, I can fix this, but I may end up wanting to rewrite the whole thing [13:32:04] ahahahaha [13:32:37] Pchelolo: [13:32:40] https://www.irccloud.com/pastebin/bd7ce4dE/ [13:32:42] thank you! [13:32:54] that solves one little mystery [13:33:10] Pchelolo: what's the mystery? [13:33:27] that the changeprop just stopped running [13:33:30] for that job [13:33:41] the answer is simple - the jobs vanished [13:34:06] this one is used to submit a big chunk of events and then wait for them to process [13:34:17] in this case - wait more then a week [13:36:34] oh [13:43:53] * klausman will be afk for 30m-45m, running an errand [13:48:49] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6485312, @Ottomata wrote: >> The long-term answer (which might be stream processing stuff?) > is stream processing stuff > >> In the very short ter... [14:00:37] elukey: ok, you nailed it, that error prevents the bars from being set further down in that same function. I think you're a JS developer now... [14:01:39] * klausman back [14:02:15] Btw, the PSU error on kakfa-jumbo1008 should be gone, the DC people seated a wobbly power cable [14:03:14] yay ty! [14:03:24] milimetric: :O :O [14:03:27] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient? [14:06:51] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10CDanis) >>! In T263496#6491422, @JAllemandou wrote: > Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster... [14:10:11] elukey: do you have a fork of this somewhere that I should do the PR from or should I do my own? [14:13:06] milimetric: I have a fork under my user in gh, but probably best if you just fork and send.. can we live test the fix on hue-next? [14:13:44] sure, lemme make the commit, I can PR it to your fork if you want? [14:14:43] holy crap do I have to do all this Review Board stuff? https://docs.gethue.com/developer/development/#setup [14:16:01] That's not exactly a low barrier of entry. [14:16:17] But it sounds like you might get away with a GH PR if the patch is simple enough [14:16:59] milimetric: nono just fork cloudera/hue on gh and send the pr [14:17:12] I already sent 5 of them [14:17:30] best if you PR from your repo, mine is a little messed up [14:18:09] they've got a whole process there... it seems like we should follow it... well, anyway [14:18:25] I meant I can send you the commit and we can live test with it (or you gotta tell me how to apply it) [14:18:56] The second sentence says: "For more complex patches it's advisable to use RB than a plain pull request on github." which kinda sounds like simple stuff is ok as GH PR [14:19:01] ah so I manually change files on an-tool1009, I know it is not pretty but quicker for small fixes [14:22:07] ottomata: o/ - https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647/10/modules/profile/manifests/hadoop/common.pp - wdyt? [14:22:51] no regex.yaml, etc.. [14:23:22] only one parameter, and we'd drop a ton of per-host configs (to change the partition list when a disk is broken for example) [14:34:37] elukey: whaajaaa????? cool wait factor partitions???? [14:34:38] looking [14:35:10] elukey: THAT IS SO COOL! [14:35:11] yeah1 [14:35:13] !!! [14:35:18] great idea [14:35:35] add some good param docs for that explaining that it is filtering by that as mount path [14:35:38] so the disk had better be mounted! [14:35:41] but that is very cool [14:35:47] super doing so [14:36:58] this is nice https://puppet-compiler.wmflabs.org/compiler1003/25393/analytics1034.eqiad.wmnet/index.html [14:37:13] on 1034 (test cluster) I have probably not have kept track of all disks broken [14:37:23] and it is telling me that 2 are not mounted [14:38:19] but j seems to be there, mmm need to double check [14:38:26] the prod workers are all good afaics [14:42:15] cool [14:43:55] ah no super weird [14:43:55] /dev/sdj1 => { [14:43:55] filesystem => "ext4", [14:43:56] mount => "/var/lib/hadoop/data/k", [14:44:06] /o\ [14:44:11] this is on 1034 [14:44:15] okok it is right [14:44:17] uff [14:44:30] the hadoop test is ... a little messed up [14:55:55] 10Analytics-Radar, 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Platform Team Initiatives (Revision Storage Schema Improvements), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Marostegui) [14:57:58] elukey: ok, where on an-tool1009 is this thing? [14:58:54] all the files are under /usr/lib/hue [14:59:03] but I am not sure if you can modify those [14:59:09] if you gimme the diff I can try [14:59:12] and restart [15:00:01] (03CR) 10Mforns: [V: 03+2] "OK, I finished the suggested changes and tested this with real data." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/628933 (https://phabricator.wikimedia.org/T263495) (owner: 10Mforns) [15:00:55] ottomata: I think this should fix the mediawiki_job deletion script ^ feel free to review, sorry it took me more than I thought [15:01:37] ping milimetric ottomata mforns fdans [15:02:19] ping joal [15:02:36] mako changes should get picked up, let's see [15:02:52] I can force a restart [15:02:55] already changed? [15:03:03] ah, not git [15:03:11] no, I have a patch... I guess I can just edit [15:03:24] yeah it is a debian package deployed basically [15:04:44] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10JAllemandou) > Currently these reports are going to Logstash; I don't think there's any refinement possible there? Not the refinement we do usually on the cluster indeed.... [15:10:14] elukey: those are all owned by root :( can you give me write permissions? [15:12:01] milimetric: what is the diff? I can apply it on the fly [15:14:55] elukey: /home/milimetric/patch.diff on an-tool1009 [15:15:00] super [15:18:57] 10Analytics, 10Event-Platform: EventStreams error in logs: Error: Invalid number of arguments (for prometheus?) - https://phabricator.wikimedia.org/T263759 (10Ottomata) [15:19:21] 10Analytics, 10Event-Platform: EventStreams error in logs: Error: Invalid number of arguments (for prometheus?) - https://phabricator.wikimedia.org/T263759 (10Ottomata) [15:46:34] PROBLEM - HDFS corrupt blocks on an-master1001 is CRITICAL: 5 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen [15:47:18] 5 is not problem, will check why [15:47:23] * no problem [16:21:30] now it's 9 [16:24:17] very strange [16:26:33] so everytime it happened in the past it eventually auto-resolved [16:26:54] we can try to check/list the corrupt blocks on the namenode [16:31:09] klausman: elukey razzi ops sync here? [16:31:09] https://meet.google.com/ako-cdgs-mmw [16:31:19] yep coming sorry, got some water [16:31:34] thirst is no excuse for tardiness, i'm marking this down in your records [16:34:20] (03PS8) 10Milimetric: [WIP] Add filter/split component to Wikistats TODO: I'm pushing this even though I haven't tested it in the UI at all, because it passes tests and I find that interesting, I want to see the difference between this and the eventually working version, to see if I miss adding any tests as I make it work. Basically, this just tries to remove references to "breakdown" from everywhere. [analytics/wikistats2] [16:34:20] (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [16:37:43] (03PS9) 10Milimetric: Add filter/split component to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/613114 (https://phabricator.wikimedia.org/T249758) (owner: 10Fdans) [16:43:56] Came accross that the other day - thought it would be worth sharing: https://backstage.io/ [16:49:01] 10Analytics, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) We need to deploy this change , see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats_2#Contributing_and_Deployment [16:49:44] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) a:05paulkernfeld→03razzi [16:49:59] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) Assigning to razzi for deploy [16:50:27] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n, 10good first task: Add link to translatewiki.net in wikistats footer - https://phabricator.wikimedia.org/T261502 (10Nuria) cc @fdans so he knows this link was added [17:28:31] running a fsck / on hdfs to see if I get more info about those corrupt blocks [17:30:43] mforns: o/ - are the refine failures already checked/handled? [17:31:00] (asking since we have the usual icinga alert open for refine flags failures) [17:31:01] elukey: let me see [17:31:29] elukey: no, I'll do, sorry [17:32:01] mforns: nono please don't say that, I was just checking alarms :) [17:32:21] yes, don't worry! [17:32:21] milimetric: the change is live on an-tool1009 [17:32:40] mforns: ah also when you have a moment I can explain the oozie admin test that I have in mind [17:32:51] elukey: ok [17:33:32] elukey: oh looks good: https://hue-next.wikimedia.org/hue/jobbrowser/jobs/0009826-200915132022208-oozie-oozi-W#!id=0009824-200915132022208-oozie-oozi-W [17:33:48] (you get a bar, though I think it's wrong? and you get no more error) [17:34:37] there's definitely a bunch of bugs on that page [17:34:53] there are some random ghost arrows on the left floating around and the bottom two boxes are just nonsense [17:35:12] that's more fundamental... I don't think the fix we did helps you at all, does it? [17:35:13] so I checked webrequest-load-wf-text-2020-9-21-21, that failed while doing add_partition [17:35:25] and it has a green bar at the end, but no red bar at the top [17:35:40] the error in the logs is gone though [17:35:51] hm... maybe the status isn't error anymore [17:36:22] I know while I was testing the thing it was failing on had a status of ERROR [17:36:36] and when I applied the change manually, it set a red bar, but I don't remember where [17:36:43] I see another one in the console "POST https://hue-next.wikimedia.org/metadata/api/catalog/list_tags 500" [17:36:55] but I can't find traces of that in the logs of hue (that 500 I mean) [17:38:12] yeah, it just looks like serious problems on this page [17:38:20] klausman, joal - this is the fsck result https://phabricator.wikimedia.org/P12795 - so I think that the namenode's jmx metrics have a bug [17:38:22] that 500 sounds like they're making a bad request [17:38:48] I mean... I can't even click on any of the subworkflows, they throw an error [17:39:10] ah snap I didn't see those 400s [17:39:12] I'm gonna abandon this one... maybe I'll just comment on the issue [17:40:19] on the logs I see [17:40:19] desktop.lib.rest.http_client.RestException: 400 Client Error: Bad Request for url: http://an-coord1001.eqiad.wmnet:11000/oozie/v1/job?doAs=elukey&timezone=UTC [17:40:27] u>The request sent by the client was syntactically incorrect.


Apache Tomcat/6.0.53

(error 400) [17:40:38] that is weird [17:42:42] ahhh and on the oozie side [17:42:43] java.lang.IllegalArgumentException: id cannot be empty [17:43:22] so I bet money that hue-next is tailored for oozie 5's api [17:48:53] ah yes I have a lead on the code, sigh [17:57:46] mforns: i see you are looking inot (or looked into) that refine error? [17:57:50] did you find out what went wrwong? [17:57:58] the error looks like maybe someone deleted the schema again [17:58:07] the CirrusSearch one? [17:58:31] no CitationUsage [17:59:06] yup https://meta.wikimedia.org/w/index.php?title=Schema:CitationUsage&action=history [17:59:09] i'll add it to exclude list [17:59:23] ottomata: ok [18:00:42] milimetric: opened https://github.com/cloudera/hue/issues/1273 [18:01:47] mforns: can you and razzi finalize the deletion stuff, maybe next week? [18:01:47] https://gerrit.wikimedia.org/r/c/operations/puppet/+/628895 [18:01:56] I don't feel comfortable merging that today [18:02:05] ottomata: sure no problem [18:02:06] and i guess your script changes needs a refinery deploy anyway [18:02:08] ya? [18:02:17] yes, I was going to do that today [18:02:21] ok cool [18:02:25] but yea, for you, it's to hasty [18:02:32] thank you! [18:02:48] when will you be back? [18:02:58] 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/upload - https://phabricator.wikimedia.org/T263377 (10SecretName101) [18:03:00] you guys can get the checksum together, verify the delete will do the right thing, maybe do the first run manually, and razzi can merge puppet [18:03:05] monday oct 12th [18:03:18] ok, yes, we can do that [18:03:51] 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/uploads - https://phabricator.wikimedia.org/T263377 (10SecretName101) [18:04:45] Cool. mforns: want to set aside time on Monday to do that? Apacheconf will be going on most of the rest of next week [18:05:49] 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) Testing idea: on analytics1030 (test cluster, where oozie runs) we have: ` elukey@analytics1030:~$ sudo cat /etc/oozie/conf/adminusers.txt # Admin Users, one user... [18:06:02] mforns, razzi - https://phabricator.wikimedia.org/T262660#6492160 [18:06:12] lemme know if it makes sense [18:06:25] lookin [18:07:07] elukey: yes, that makes total sense [18:07:53] should be very easy to test [18:07:57] we'll have to prepare a bit the terrain for the webrequest job to write its output to a separate place owned by analytics-privatedata [18:08:13] the job writes to 3 different tables [18:08:23] but yes, we'll manage [18:08:24] could also be any other job, not webrequest [18:08:26] even a simpler one [18:08:35] there's not a lot of data there [18:08:53] IIRC joseph a while launched the pageview one [18:09:20] but feel free to also kill/reuse the running webrequest job [18:09:28] I run it via as script in my home [18:09:53] mforns: /home/elukey/launch_webrequest_bundle.sh on an-tool1006 [18:09:56] elukey: we already ran it yesterday [18:10:25] yeah my point is that you don't need to recreate different tables etc.. if you don't really want [18:10:35] you can use the existing ones, like kill the current coord etc.. [18:10:37] but to run it under analytics-privatedata and not fail when writing data, we have to create a couple tables and setup a couple directories, not a big deal, but still todo [18:10:38] feel free to do it [18:10:50] mforns: why privatedata? You can use analytics no? [18:11:07] I mean, exactly what it is running now [18:11:36] oh I see... [18:11:52] I mean, I don't really care about those :D [18:12:33] hehe, I understand now, maaaan, I'm really bad in ops [18:12:55] you can kill it / restart it anytime, eventually I'll restart it just to have some continuity and catch bugs, but data holes are really not important in there [18:13:18] ahahahah nono it is something that I only use that it would require some "standardization" [18:13:21] ok, easier then! we can do this today if razzi agrees [18:13:22] my bad on this side [18:13:45] then if it works it is a matter of following up with others [18:14:00] yea, makes sense! [18:14:02] and eventually deploy the change in prod [18:14:06] super :) [18:14:10] :] [18:14:38] all right thanks! Logging off for today [18:14:40] o/ [18:14:58] oh razzi btw I missed your last message: yes we can work on the deletion script on monday [18:18:23] nuria: i made some suggested edits to your blog posts, looks great! [18:18:31] feel free to ignore any or all of them :) [18:20:34] 10Analytics, 10Platform Engineering, 10Epic, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo) [18:34:55] mforns: I'm down to look at the oozie stuff today, let me know when works for you [18:35:30] hey razzi :] yes, now works for me, or else in 90 mins, what do you prefer? [18:35:47] mforns: Let's go for it now [18:35:51] k! omw [18:36:25] :) <3 [18:58:33] ottomata: super super thanks [19:00:40] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) assigning to @mforns [19:29:54] 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10razzi) Confirmed with @mforns that adding to the bundle.properties `oozie.job.acl = ` (in this case wikidev) Allows administering jobs via th... [19:57:18] 10Analytics-Clusters, 10Operations, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10RobH) I'm removing the #ops-eqiad tag, as this is hurting their open task metrics when its never actually been within their ability to move this forward. When thi... [20:05:51] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10mforns) I think we've discussed this before, but just for the record: I think one important aspect of the sanitization config is that changes to... [20:32:34] 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10elukey) @razzi have you tried to add `analytics-privatedata-users` as `oozie.job.acl` and see if it works? In theory it should, IIUC the group membership is checked using... [20:39:27] 10Analytics-Clusters, 10Analytics-Kanban: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 (10razzi) @elukey we did try that, and it didn't work. It's possible we misconfigured something; could give that another try. [20:39:49] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Thanks @Ottomata for your help in getting things working today... [20:47:13] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Thanks @Ottomata for your help in getting things working today... [20:56:05] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10egardner) Also, since SearchSatisfaction is still a legacy schema, it ma... [20:59:49] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Mholloway) Hey @Ottomata (and @egardner!), I'm still catching up here bu... [21:01:05] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Mholloway) Oh, I think @jlinehan wanted to give it another looks as well. [21:01:20] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10srodlund) [21:01:30] 10Analytics, 10Event-Platform, 10Technical-blog-posts: Story idea for Blog: Wikimedia's Event Platform - https://phabricator.wikimedia.org/T253649 (10srodlund) 05Open→03Resolved This is published! Note I went with an image of a stile from Commons as opposed to Flickr. https://techblog.wikimedia.org/2020... [21:01:56] 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10jijiki) @Milimetric that is fine, take your time and thank you! [21:11:13] 10Analytics, 10Platform Engineering, 10Epic, 10Platform Engineering Roadmap, 10Platform Team Initiatives (API Gateway): AQS 2.0 - https://phabricator.wikimedia.org/T263489 (10Pchelolo) [21:21:31] 10Analytics, 10Product-Analytics, 10Structured Data Engineering, 10SDAW-MediaSearch (MediaSearch-Beta), 10Structured-Data-Backlog (Current Work): [L] Instrument MediaSearch results page - https://phabricator.wikimedia.org/T258183 (10Ottomata) So, I'm disappointed that we need EventStreamConfig set up for... [22:04:35] 10Analytics, 10Operations: Augment NEL reports with GeoIP country code and network AS number - https://phabricator.wikimedia.org/T263496 (10Nuria) @JAllemandou I think adding geo info (or rather swapping IP by Geo info ) is something that would need to happen in this case (in the absence of stream processing b... [22:09:12] (03CR) 10Nuria: [C: 03+2] Set timeout in oozie dumps-dependent jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/629073 (https://phabricator.wikimedia.org/T263529) (owner: 10Joal) [22:09:15] (03CR) 10Nuria: [V: 03+2 C: 03+2] Set timeout in oozie dumps-dependent jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/629073 (https://phabricator.wikimedia.org/T263529) (owner: 10Joal) [22:24:06] 10Analytics, 10Event-Platform: Figure out where stream/schema annotations belong (for sanitization and other use cases) - https://phabricator.wikimedia.org/T263672 (10Nuria) +2 to @mforns comments [23:46:02] 10Analytics-Clusters, 10Analytics-Kanban, 10User-Elukey: Create temporary cluster to hold a copy of data for backup purposes - https://phabricator.wikimedia.org/T263814 (10Nuria)