[00:10:07] fdans: I asked my family in India to check and while it's possible they missed an app, they certainly covered the most popular ones [00:10:19] I am leaning towards that the image is requested but not displayed [01:26:12] 10Analytics, 10Analytics-Kanban, 10Anti-Harassment, 10Event-Platform, and 2 others: Migrate Anti-Harassment EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T268517 (10Ottomata) Hi all, this PHP client migration blocker has been a pain, hopefully we are getting close to done. I... [01:38:18] is there a gap between say when I make a webrequest and when it shows up in a Hive query? [01:38:31] like if I were to query for Feb 9 UTC, would it show up now? [03:47:02] sukhe: just logging in to say wowowowowow super amazing job, incredible detective work [06:11:37] !log disable systemd timers on an-launcher1002 (prep step for bigtop) [06:11:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:11:48] * elukey coffee [06:14:44] !log disable timers on labstore nodes (prep step for bigtop) [06:14:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:24:11] Good morning [06:29:51] joal: morning! [06:36:20] (03PS1) 10Lex Nasser: Fix unit tests that ensure certain requests fail and clean up all unit tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) [06:38:30] joal: goodnight! [06:42:42] Good night lexnasser :) [06:51:30] good morning :) [06:54:43] joal: very interesting readin https://github.com/prestodb/presto/issues/15685 [06:55:05] basically the issues that I was seeing with presto 246 were all related to a typo in the worker hive config file (my bad) [06:55:20] but the division by zero stacktrace was basically (no nodes available) [06:55:49] (in our case - no nodes available for the given catalog) [06:56:12] not particularly happy about this, Trino might have a little bit more flexibility [06:56:21] anyway, the presto upgrade should be unblocked :) [07:09:46] !log stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode [07:09:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:12:06] !log stop airflow on an-airflow1001 (prep step for bigtop) [07:12:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:13:23] elukey: I'm going to manually kill all jobs except Flink - they are users jobs [07:14:00] joal: sure [07:19:51] !log Killing yarn users applications [07:19:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:20:03] elukey: sending a message to slack to tell we start [07:23:39] joal: let's advertise that the actual downtime will happen after the backup, just in case [07:23:49] Ack elukey - makes sense [07:24:00] better: let's say that we are draining the cluster etc.. (prep steps) [07:25:11] in the meantime I am rolling out the new apt config (plus the hive/yarn settings that the cluster needs) [07:25:16] ack elukey [07:25:23] hello teammmmm :] [07:25:39] Good morning mforns :) [07:25:43] hola mforns! [07:25:47] morning [07:26:54] ahahha did you see https://phabricator.wikimedia.org/T273741 ? We hit hacker news :D [07:27:42] :) [07:28:06] !log roll out new apt bigtop changes across all hadoop-related nodes [07:28:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:30:46] elukey: distcp is happier with the healthier backup namenode :) [07:34:43] subtitle: distcp complained about Luca not doing his job correctly and it was right :D [07:38:11] not exactly :) [07:38:30] elukey: last run of distcps has started, so far so good [07:39:31] elukey: there is a script somewhere restarting a spark job for analytics-search [07:39:40] elukey: ores_bulk_ingest.py [07:39:54] I kill the app, it's back in no time [07:39:59] ahahahh [07:40:05] :( [07:40:12] so it cannot be on an-launcher1002 [07:40:14] dcausse: if by any chance you're nearby [07:40:23] lemme check puppet [07:40:27] sure elukey [07:41:23] not in there [07:42:18] maybe it is a stat100x timer [07:42:28] elukey: could be a cron :( [07:44:56] don't find it on stat mm [07:45:14] could be from airflow machine? [07:46:37] elukey: launched by Erik - I'm assuming the cron/timer is on one of discovery machine [07:48:19] joal: in theory no, they don't have access to our network [07:48:24] hm [07:49:14] spark.yarn.dist.archivesfile:///home/ebernhardson/ores_venv.zip#venv [07:49:27] not great paste buuut it is Erik for sure :D [07:49:41] :) [07:49:44] srv/deployment/wikimedia/discovery/analytics/spark/wmf_spark.py [07:50:05] lemme check Erik's crons [07:50:07] ack [07:51:05] nothing on the stat boxe [07:51:08] *boxes [07:51:09] :( [07:51:14] an-launcher? [07:52:11] he doesn't have access to it [07:52:40] elukey: it's Erik we're talking about :) [07:53:22] I know I know :) [07:54:32] nothing on airflow1001 that I can see [07:55:18] elukey: I think the thing could be a process launched in backgrounf [07:55:23] instead of a timer [07:55:44] a process polling yarn every minute of so, checking that the app is running and relaunching if needed [07:56:25] could be a good bet.. so I found /home/ebernhardson/ores_venv.zip on an-airflow1001 [07:56:32] lemme check again [07:56:32] hm [07:57:08] bingo :) [07:57:39] it is a tmux yes [07:57:47] \o/ [07:58:14] that I have to kill now though.. hopefully it is fine [07:58:24] elukey: +1 [07:58:47] joal: done [07:59:14] elukey: yarn app killed [07:59:18] checking [07:59:46] all good [08:00:01] elukey: We can continue to tear-down the tools :) [08:00:09] Thanks for finding the script :) [08:01:59] :) [08:02:04] so the new apt config is deployed [08:03:04] !log stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade [08:03:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:04:54] done [08:05:11] !log stop oozie an-coord1001 - prep step for bigtop upgrade [08:05:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:05:17] joal: --^ [08:05:24] ack elukey [08:05:41] done, lemme know when ok to go for hive [08:05:51] elukey: I'm monitoring distcp jobs - still quite some to go, but it's moving [08:05:55] please go elukey [08:06:42] David is killing flink [08:06:55] awesome - thanks dcausse [08:07:00] !log stop hive on an-coord100[1,2] - prep step for bigtop upgrade [08:07:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:08:01] all right done [08:08:17] !log stop jupyterhub on stat100x [08:08:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:15:12] joal: if you are using the esam bastion change it, Arzhel is doing maintenance :D [08:15:21] ack elukey [08:15:26] flink job is killed but there's still the session cluster (application_1609949395033_179785) running, can't login to stat machines I can't kill it [08:15:37] ack dcausse - doing it now - thanks mate [08:16:06] !log Kill flink yarn app [08:16:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:16:09] done [08:16:23] thanks! [08:19:19] !log umount /mnt/hdfs from all nodes using it [08:19:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:19:54] ok joal we are ready [08:20:08] elukey: wait wait wait, i'm not !!!! :) [08:20:13] I can also set safemode if you want [08:20:19] ahahah nono I mean for the backup [08:20:25] I know - joking :) [08:21:24] last backup run has started, 'big' stuff is mostly done, 'numerous' stuff not yet [08:21:35] one thing that we forgot is the million cron jobs scheduled on the stat boxes [08:21:43] but we are not going to stop all of them : [08:21:44] :D [08:21:51] nope :) [08:22:08] empty cluster - \o/ [08:24:00] elukey: let's activate safemode if easy [08:25:27] !log set hdfs safemode on for the Analytics cluster [08:25:29] joal: done [08:25:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:26:01] ok so now we wait one/two hours [08:26:09] we are on track timing wise [08:26:23] elukey: I hope I can make it faster, but really not sure [08:27:40] elukey: you can go for coffee, I'm on backup copy :) [08:28:43] elukey: actually no [08:28:57] elukey: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot issue delegation token. Name node is in safe mode. [08:29:04] I had not forsee that!!! [08:29:06] looool [08:29:19] elukey: already running jobs are ok, but new ones can't be launched [08:29:23] MEH [08:29:44] !log leave hdfs safemode to let distcp do its job [08:29:44] meaning: you can't set hdfs in read-only mode and use it [08:29:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:29:47] how bad [08:30:04] fixed elukey - job started [08:30:10] it is also called "safe mode" so maybe it is considered more than read only [08:30:12] thanks [08:30:17] yeah you're right [08:30:19] but interesting indeed [08:30:23] I wouldn't have thought about it [08:30:26] anyway, all good :) [08:30:30] :) [08:30:38] I am going to stop replication on mysqls in a bit [08:30:57] sure [09:44:20] elukey: last 2 jobs running [09:45:03] * elukey dances [09:45:19] elukey: unfortunately it still represents some time :( [10:04:05] backup of hive/oozie dbs done [10:04:25] !log stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108 [10:04:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:20:24] joal: going to step away for ~15 mins, I did everything before the stop cluster step [10:20:31] once distcp finishes we can start [10:21:19] exactly elukey - I monitor, but it's long :( [10:54:27] 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey) [10:54:43] extended the maintenance to 14PM [10:55:34] I am going to quickly eat something before starting [10:56:58] ack elukey - the last distcp jobs are in the long phase of commiting (deletion of old phase) [10:57:20] elukey: this part is not parallelized, and going over millions of files take time :( [10:58:00] elukey: If you prefer, I can also stop the jobs, as I am sure that all the needed data is copied with the correct perms - We might just have a little bit more than needed :) [11:02:43] I target the job being fully done in less than 1/2h - I'm sorry for the extended wait :( [11:08:15] nono let's do things properly :) [11:08:31] half an hour is good to avoid headaches later on [11:08:43] also I hope that the cookbooks will do their job in a reasonable amount of time [11:08:58] ack elukey [11:09:06] only one letf [11:31:36] elukey: We're ready :) [11:32:25] wow [11:32:32] all right, stopping the cluster joseph [11:32:46] BANZAI [11:32:52] sending a message on slack [11:35:59] crossing fingers [11:36:11] hdfs save namespace is taking a bit but it is expected with such a cluster [11:36:49] * mforns back in 40 mins [11:37:27] elukey: imagine a thousand-nodes cluster :) [11:37:59] I cannot :D [11:38:09] hdfs namenode federation etc.. [11:38:12] what a mess [11:38:24] in such case you hire hadoop committers :D [11:38:28] yep [11:38:59] I have become a distcp aficinados - I can tell where to improve :) [11:40:05] it is now stopping daemons [11:43:22] hdfs datanodes are stopped gently so 60 of them take time :) [11:44:32] then master nodes and finally journals [11:44:49] once done we'll start the upgrade [11:47:20] it should take ~10/15 mins more [12:00:32] joal: we are at the journalnodes, after them (5) we should be all down [12:03:13] ack elukey - ready for anything I can help with :) [12:04:52] joal: ok kicking off the upgrade :) [12:05:05] Ack! [12:10:04] it is upgrading the worker nodes [12:13:25] 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey) [12:14:03] this step should take 15ms more in theory [12:14:08] 15mins [12:14:20] ms would have been a too high bar to reach :D [12:14:25] :) [12:21:08] good morning team, reading backscroll [12:21:16] good morning fdans :) [12:22:56] hellooo fdans :]\ [12:26:21] good morning [12:26:43] fdans: so we are upgrading the packages basically, ETA 20 mins more or less for master+workers [12:26:55] then we'll go through clients one-by-one testing [12:27:13] all steps in https://etherpad.wikimedia.org/p/analytics-bigtop-upgrade [12:28:08] the / partition is full on an-master1001, known? [12:28:30] wow - thanks moritzm [12:28:43] elukey: --^ [12:29:19] ouch [12:29:51] +Well, 9G of logs in /var/log/hadoop-hdfs [12:30:27] I also recommend apt clean to clean out /var/cache/apt/archives (another 2.5G) [12:30:57] (And to find these things `sudo ncdu -x /`) [12:31:55] I dropped the namenode saved for the kerberos migration (LOL) and now we are down to 85% usage [12:31:59] moritzm: thanksssss [12:32:26] klausman: +1 on apt clean, doing it [12:32:48] we are at 79% all good [12:32:52] for the logs we have a task about it [12:32:57] Also, the hadoop logs are not compressed. Are they rotated by hadoop itself or by logrotate? [12:33:20] by hadoop itself, I hope with the new version to have the log4j version to be able to do it [12:33:26] Ack. [12:33:41] also if you check lvs there are a lot of things unbalanced [12:34:13] TIL ncdu [12:34:19] Ah, today's biznis has created a shit-ton of audit logs [12:34:29] yep yep [12:34:34] Basically, hdfs-audit.log.* is from today [12:35:01] in this case I think that distcp is the one to blame [12:35:13] since it just tried to scan 50M+ files for changes :D [12:35:22] Soory :S :) [12:35:41] it must have been ongoing for some time (a few days at least) [12:35:51] -rw-r--r-- 1 hdfs hdfs 268435648 Feb 9 09:59 hdfs-audit.log.20 [12:35:53] Not really. [12:35:59] -rw-r--r-- 1 hdfs hdfs 268435570 Feb 9 10:29 hdfs-audit.log.1 [12:36:23] half an hour of doing things -> 5.1G of logs. [12:36:29] this was from this morning probably, I suspect all getfileinfo RPC calls [12:36:43] yes yes it would make sense [12:37:08] klausman: you can drop 3/4 of the last audit ones if you have time [12:37:09] Yep, about half a million of those per logfile [12:37:24] We have twenty files, so drop the fifteen oldest? [12:38:15] let's say 10 [12:38:20] ok [12:38:26] thanks :) [12:38:46] And done. 12G free/73% used on / now [12:39:54] <3 [12:39:56] I presume the two backup tars in /root are temporary and will be removed once we're done with today's migration? [12:40:02] I'm surprise we have no older audit files - I run regular distcp jobs since a few days [12:40:04] weird [12:40:29] Well, I dunno how many files log4j is keeping around. It might be that the traffic created today expired everything older [12:40:45] The number 20 sounds like a limit of that kind. [12:41:06] agreed [12:43:11] ok so the hdfs daemons are ok after the install, the yarn nodemanager are a little angry that the RM is not up, but all good [12:43:49] now I am going to re-enable slooowly on worker nodes [12:44:05] * klausman puts fingers in ears and waits for explosion [12:44:59] thanks for the trust :D [12:45:59] changes are ok in the canary, proceeding with all [12:52:25] joal: in the meantime, can you folks coordinate on what to test etc..? [12:52:32] sure [12:52:39] <3 [12:52:53] I'm available to test anything [12:53:15] so the first step will be hdfs only, then we'll upgrade stat100x and hive etc.. [12:53:20] about tests: ack [12:55:49] namenode 1001 loading inodes [12:55:55] fdans: for hdfs, lets both do stuff (read, touch files, try other users) [12:56:22] joal: sounds good [12:56:28] fdans: then when ready there will be: Hive, beeline, spark, oozie + hue (AQS rerun) [12:56:40] klausman: can you please keep an eye on an-master100[1,2] disk usage if you have time? [12:57:18] the only thing we don't test manually is camus [12:57:26] Will do [12:57:34] elukey: do you wish we do that (test camus manually) ? [12:58:02] joal: we can restart the timer, but later on :) [12:58:22] sure elukey - I can also devise a manual run of something else [12:58:28] just to test the mechanics [12:59:47] joal: which host should I be testing hdfs stuff? [13:00:43] fdans: let's here from elukey about the ones updated [13:01:13] joal: sorry, got ahead of myself :) I'm writing down a list of commands to test [13:01:26] no prob fdans [13:01:27] :) [13:08:17] the namenode is still not out of safemode [13:10:04] me too will test! [13:10:59] Yay - MOAR TESTERS :) [13:11:59] so the namenode sees only 26 workers up, instead of 59 [13:12:02] going to check why [13:15:14] disk space is still unchanged, btw [13:16:17] so some datanodes try to contact 1002 for some reason, going to upgrade the standby as well, since 1001 is active [13:17:27] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10mforns) @Ottomata Is a sanitization refine job needed? I thought the current sanitization job was sanitizing everything that is inside the event database. My id... [13:20:17] and 1002 is fetching the fsimage from 1001, so it will take a bit [13:24:52] it is taking a long time to move few GBs, but I checked via tcpdump and things are moving [13:25:23] ok cookbook finished [13:25:31] now let's see how 1002 behaves [13:26:07] it seems as if half of the datanodes need to try to contact the standby [13:26:11] and half the active [13:26:29] elukey: round robin? [13:30:52] no idea... [13:30:59] so the standby is up and running [13:31:42] I am restarting one datanode that doesn't show up to see how it goes [13:32:06] never happened in testing [13:32:59] very slow to start [13:33:11] PROBLEM - Hadoop DataNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:33:39] PROBLEM - Hadoop DataNode on analytics1065 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:36:15] RECOVERY - Hadoop DataNode on analytics1065 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:36:16] for some reason some datanodes are refusing to read the datanode dirs [13:36:23] :( [13:38:27] RECOVERY - Hadoop DataNode on analytics1069 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:38:41] !log restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report) [13:38:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:01] !log restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs [13:39:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:39:56] ok I see some progress [13:41:55] elukey: do you wish to brainstorm? [13:42:47] elukey: I have access to the yarn UI - it says 56 workers [13:43:45] s/56/59 [13:43:48] joal: yeah but those are yarn node managers, datanodes are in trouble [13:43:57] I read your note yes [13:44:06] do you wish to explain in the cave? [13:44:09] elukey: --^ [13:45:07] joal: gimme a second to check [13:45:18] of course elukey [13:50:11] elukey: I don't know how ou do but the number of nodes seen by the active namenode grows every now and then [13:50:29] joal: we can bc [13:50:36] sure elukey - I'm there [13:54:11] fdans: can you do PR and advertise that it is taking more than expected? [13:54:35] elukey: understood [13:54:51] no ETA to be given yet right? [13:56:14] fdans: I'd say a couple of hours, we are having trouble in bootstrapping hdfs sadly [13:56:19] for some unknown issue [13:58:02] elukey: [13:58:14] (srry, pressed enter accidentally) [13:58:57] elukey, all good, let's all take a deep breath and carry on, I'm informing :) [14:01:57] o/ just getting on checking email! [14:03:51] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:04:19] PROBLEM - Presto Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [14:05:41] downtimed --^ [14:08:24] !log restart analytics1069's datanode with bigger heap size [14:08:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:10:03] ottomata: we are in bc [14:10:07] trobles sadly :( [14:10:34] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) Oh, my understanding was that tables that only tables that didn't have underscores in them were dropped from event. Hm, I suppose we should add entri... [14:12:08] elukey: coming in a just a few [14:29:29] PROBLEM - Hadoop DataNode on an-worker1102 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:29:31] PROBLEM - Hadoop DataNode on an-worker1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:31:10] downtiming [14:37:27] RECOVERY - Hadoop DataNode on an-worker1114 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:40:07] RECOVERY - Hadoop DataNode on an-worker1102 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [14:50:41] !log restart analytics1069 with 16g heap size to allow bootstrap [14:50:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:50:49] !log restart analytics1072 with 16g heap size to allow bootstrap [14:50:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:54:49] !log restart an-worker1090 with 16g heap size to allow bootstrap [14:54:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:57:15] !log restart an-worker1102 with 16g heap size to allow bootstrap [14:57:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:01:32] !log restart an-worker1103 with 16g heap size to allow bootstrap [15:01:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:01:35] !log restart an-worker1104 with 16g heap size to allow bootstrap [15:01:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:18:50] 10Analytics: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey) [15:32:27] 10Analytics-Clusters: Balance Kafka topic partitions on Kafka Jumbo to take advantage of the new brokers - https://phabricator.wikimedia.org/T255973 (10razzi) [15:38:49] !log restart namenode on an-master1002 [15:38:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:55:25] !log restart datenode on an-worker1115 [15:55:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:58:44] !log restart datanode on analytics1058 [15:58:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:08:06] !log restart datanode on an-worker1080 withh 16g heap [16:08:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:14:26] !log restart datanode on analytics1059 with 16g heap [16:14:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:58] !log restart datanode on ana-worker1100 [16:31:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:39:46] (03CR) 10Milimetric: "looks great, small note about the console.log" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/662821 (https://phabricator.wikimedia.org/T273404) (owner: 10Lex Nasser) [16:44:32] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Uncaught TypeError: navigator.sendBeacon is not a function - https://phabricator.wikimedia.org/T273374 (10Milimetric) a:05Milimetric→03Ottomata [16:59:32] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10elukey) [17:45:58] (03PS1) 10Nettrom: Remove the Growth schemas from the whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/663020 (https://phabricator.wikimedia.org/T273826) [17:52:58] 10Analytics-Radar, 10Product-Analytics, 10Growth-Team (Current Sprint), 10Patch-For-Review: Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10nettrom_WMF) [17:53:58] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10mpopov) @fdans: Maybe? I'm not sure what modifications @mforns did to the standard sanitization process to enable 270 days of retention for @nettrom_WMF's... [18:10:25] 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10LGoto) [18:11:04] 10Analytics-Radar, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10Product-Analytics (Kanban): Growth: remove Homepage and Help Panel schemas from the schema whitelist - https://phabricator.wikimedia.org/T273826 (10LGoto) p:05Triage→03Medium [18:27:01] hello a-team, I'm not able to connect to jupyterhub. I'm trying with "ssh -N stat1004.eqiad.wmnet -L 8880:127.0.0.1:8880" as usual, but this time it fails with error "channel 2: open failed: connect failed: Connection refused". any ideas? [18:27:37] daniram: there's a big upgrade in progress, the cluster is going through a lot at the moment :) [18:28:09] (was announced on the mailing lists and various slack channels, status is that it's in progress and will be for a few more hours) [18:31:39] I didn't know:)  thanks [18:34:12] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [18:35:10] daniram: you should sign up for this mailing lislt [18:35:10] https://lists.wikimedia.org/mailman/listinfo/analytics-announce [18:35:11] :) [18:37:03] ottomata: None of us has an anaconda notebook at hand to test (not yet but soon) - would you have one? [18:37:24] RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [18:37:24] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:38:24] RECOVERY - Presto Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args com.facebook.presto.server.PrestoServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Presto/Administration%23Presto_server_down [18:39:10] thanks ottomata :) [19:06:16] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) @fdans : The parent task asks to revert the changes made in T237124. There's also the second child task, T273826, for updating the whitelist.... [19:13:38] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: remove deletion timers for Growth's sanitized EL tables - https://phabricator.wikimedia.org/T274297 (10nettrom_WMF) [19:18:42] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) [19:20:44] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: delete data older than 90 days - https://phabricator.wikimedia.org/T273821 (10nettrom_WMF) I updated the task description to make the ask clear. I've also updated the description of the parent task to make it clearer what we're look... [19:21:49] 10Analytics, 10Growth-Scaling, 10Growth-Team, 10Product-Analytics: Growth: End wider data purge window - https://phabricator.wikimedia.org/T273815 (10nettrom_WMF) [19:35:40] joal i have anaconda envs/notebooks on stat1008 I can test on... [19:35:59] Hi fkaelin - Thanks for the offer, andrew tested :) [20:05:33] PROBLEM - Hive Metastore on an-coord1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [20:06:27] !enabling puppet and runnin puppet on an-launcher1002, restarting all timers there [20:22:50] RECOVERY - Hive Metastore on an-coord1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [20:31:25] !log Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus [20:31:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:34:34] 10Analytics-Radar, 10Better Use Of Data, 10MediaWiki-API, 10Platform Engineering, and 2 others: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414 (10sdkim) >>! In T108414#6812482, @mpopov wrote: > Also, with the addition of SQL Lab & Presto to Su... [20:35:33] 10Analytics: Make camus (or gobblin) jobs run in `essential` or `production` queue - https://phabricator.wikimedia.org/T274298 (10JAllemandou) [20:40:11] joal [20:40:16] https://www.irccloud.com/pastebin/KFLW8XKx/ [20:40:16] Yes :) [20:40:19] thanks :0 [20:50:37] !log rebalance kafka partitions for codfw.resource-purge [20:50:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:51:30] !log Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus [20:51:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:18:24] fdans: checked that there's no other 'rows' literal in any hive query under refinery/oozie [21:19:20] 10Analytics-Radar, 10Better Use Of Data, 10MediaWiki-API, 10Platform Engineering, and 2 others: Load API request count and latency data from Hadoop to a dashboard - https://phabricator.wikimedia.org/T108414 (10bd808) >>! In T108414#6816549, @sdkim wrote: > Given there is no data being collected and this ta... [21:28:41] PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:29:07] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:30:27] PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:31:29] PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:00:17] RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:02:23] RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:04:05] !log rebalance kafka partitions for eqiad.resource-purge [22:04:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [22:11:59] RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:14:35] RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:23:14] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:23:50] PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:25:58] PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:29:14] PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:31:35] 10Analytics-Radar, 10MediaWiki-API, 10Patch-For-Review, 10Platform Team Initiatives (Modern Event Platform (TEC2)), 10User-Addshore: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables - https://phabricator.wikimedia.org/T137321 (10Mholloway) I could probably pick this up as a 10%-ish exerci... [22:51:21] paste for refine URI stacktrace [22:51:23] https://www.irccloud.com/pastebin/qfy1lpD8/ [23:00:29] RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:06:23] RECOVERY - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:08:23] RECOVERY - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:10:19] RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:22:28] 10Analytics, 10Product-Data-Infrastructure, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10colewhite) >>! In T265938#6812467, @Krinkle wrote: > What does "presented alongside" mean? The ability to issu... [23:28:46] PROBLEM - Check the last execution of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:31:16] PROBLEM - Check the last execution of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:33:56] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:44:13] 10Analytics, 10Patch-For-Review: Upgrade the Analytics Hadoop cluster to Apache Bigtop - https://phabricator.wikimedia.org/T273711 (10JAllemandou) Problems we have found a solution for (even if not great): * Hive jobs failing due to new reserved keywords in HQL * Hive jobs failing due to UDF type problem whe...