[00:34:40] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
[00:34:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[00:35:42] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata)
[01:07:47] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) I was able to failover using `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet`, everything seemed to...
[01:07:52] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
[01:07:55] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[01:08:22] <razzi>	 Oops I meant to do that the other way this time
[01:08:28] <razzi>	 Not a problem, 1002 is still active
[01:08:42] <razzi>	 (I should be a bit more careful saying "oops" when it comes to hdfs namenode)
[01:09:40] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
[01:09:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[04:29:18] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Transfer started.
[06:13:21] <wikibugs>	 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T283562 (10MS.NIMO)
[06:21:52] <joal>	 Good morning
[06:27:28] <elukey>	 bonjour
[06:33:26] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) @razzi I forgot to mention that DNS CNAME/SRV records are also to update, otherwise the various tools that we use will not work:  ` templates/wmnet:s2...
[07:13:39] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10GoranSMilovanovic) @mforns For the New Editors campaign analytics we do not need those two fields. Let's hear from the WMDE FUN team if they do.
[08:43:42] <wikibugs>	 (03CR) 10Michael Große: [C: 04-1] "Ok, so based on the conversation in T281356, it seems we have now two ways forward with this patch:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup)
[08:53:30] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) I have had to remove the ipv6 dns due to: T270101
[09:02:10] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi the data is cloned, however the host cannot reach any of the masters, I guess there are some FW/VLAN rules that need changing? I am checkin...
[09:02:48] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) ` root@dbstore1006:/srv# telnet db1122.eqiad.wmnet 3306 Trying 10.64.48.34... ^C root@dbstore1006:/srv# telnet db1123.eqiad.wmnet 3306 Trying 10.6...
[09:03:23] <wikibugs>	 (03PS2) 10Ladsgroup: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356)
[09:03:45] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 1:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup)
[09:04:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup)
[09:12:43] <wikibugs>	 (03PS3) 10Ladsgroup: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356)
[09:30:51] <wikibugs>	 (03CR) 10Ladsgroup: "Tested, works fine: https://grafana.wikimedia.org/d/000000162/wikidata-site-stats?viewPanel=23&orgId=1" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup)
[11:35:01] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi @elukey @Ottomata I am afraid we need to re-do all this work. I just noticed that db1125 isn't the standard HW we have, but one of the old...
[11:36:31] <wikibugs>	 10Analytics-Clusters: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10hnowlan) a:05hnowlan→03None
[12:20:20] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) I have taken care of all the stuff from our side, so db1183 is now ready to be reimaged at your convenience. Let me know if you want me to decommi...
[13:28:06] <mforns>	 hellooo
[13:29:34] <joal>	 Hi mforns :)
[13:29:38] <joal>	 ottomata: you there?
[13:29:49] <ottomata>	 hiya ya!
[13:29:54] <joal>	 Hi!
[13:30:07] <joal>	 ottomata: do you have minute to talk about revert for c*3?
[13:30:30] <ottomata>	 sure
[13:30:42] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10mforns) Thanks @GoranSMilovanovic! Yes, will wait for their thoughts.
[13:30:47] <joal>	 cave!
[13:31:59] <ottomata>	 coming internet slow and have to re-sign in
[14:14:14] <ottomata>	 elukey:   looking at the systemd multi instance stuff
[14:14:41] <ottomata>	 from what I can tell, systemd::service doesn't quite have the ability to re-use systemd unit @ files
[14:14:45] <ottomata>	 at least, in the examples I can find
[14:14:56] <wikibugs>	 10Analytics: NullPointerException at beginning of spark job - https://phabricator.wikimedia.org/T278451 (10fkaelin) 05Open→03Resolved a:03fkaelin Apologies for the delay - I haven't been running larger avro based jobs recently, and I wasn't able to find a minimal example when I created this task. I am clos...
[14:14:59] <ottomata>	 because systemd::service requires a content param
[14:15:09] <ottomata>	 so every declarration will render a new .service file 
[14:15:32] <ottomata>	 ideally we could pre-render e.g. airflow@.service
[14:15:40] <ottomata>	 with the %i part templated as the airflow instance name
[14:16:02] <ottomata>	 and just start /stop by instance name using systemd @ template
[14:16:08] <ottomata>	 right?
[14:16:17] <ottomata>	 it'll still work the same way wiith multiple .service files
[14:16:25] <ottomata>	 they'll just be duplicates of each other
[14:29:17] <elukey>	 sure but the @name is the native way that we use in systemd to support multi instance
[14:29:57] <ottomata>	 right ok, just checking
[14:30:08] <elukey>	 IIRC one example should be the kafka burrow stuff
[14:30:14] <ottomata>	 the @name will work fine. but we just miss the benefit of not having to render multiple systemd .sevice files
[14:30:22] <ottomata>	 they'll all be identical though
[14:30:59] <elukey>	 I think it is fine even if @name then
[14:31:06] <ottomata>	 elukey:  burrow doesn't use @name
[14:31:08] <ottomata>	 just looked
[14:31:12] <ottomata>	 eventlogging and prometheus do though
[14:31:15] <ottomata>	 i'm borrowing from those
[14:31:18] <elukey>	 the prometheus exporters do yes, not the service units
[14:31:31] <elukey>	 yes yes but I think it is fine even without it if you want
[14:31:34] <ottomata>	 they do acutally!  prometheus server
[14:31:55] <ottomata>	 $service_name = "prometheus@${title}"
[14:32:11] <ottomata>	 elukey fine with out it == ?
[14:34:27] <ottomata>	 razzi: o/ reporting for duty! :)
[14:34:37] <elukey>	 ottomata: I mean if you want to skip the @ it is fine for me :)
[14:34:41] <ottomata>	 oh oh
[14:34:43] <ottomata>	 no i like it elukey
[14:34:46] <razzi>	 Hi, good morning, I realized when I sent the maintenance email I said we wouldn't start for another 30 minutes
[14:34:49] <ottomata>	 it makes it easier to wildcard and shut things down as needed
[14:35:01] <ottomata>	 just was checking that i wasn't missing somethign about the duplicate .service files
[14:35:09] <ottomata>	 oh ok!
[14:35:11] <elukey>	 razzi: hi! We can start draining the cluster in the meantime
[14:35:20] <elukey>	 without applying the yarn patch
[14:35:31] <razzi>	 yeah I guess that only affects us, good thinking
[14:36:18] <elukey>	 it will take time to good to start with that
[14:37:13] <razzi>	 Ok, get ready for a log of !logs
[14:37:58] <razzi>	 Any particularly destructive step I'll confirm before running, but for starters just disabling puppet and timers on an-launcher
[14:39:16] <razzi>	 !log stop puppet on an-launcher and stop hadoop-related timers
[14:39:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:39:30] <ottomata>	 k
[14:40:33] <wikibugs>	 10Analytics: Add ignore success flags option to pageview monthly dumps - https://phabricator.wikimedia.org/T283593 (10fdans)
[14:46:06] <elukey>	 razzi: you missed some timers :)
[14:46:27] <elukey>	 check systemd list-timers
[14:51:52] <elukey>	 I just stopped eventlogging_*.timer and monitor_*.timer (so we can save time later on)
[14:51:55] <razzi>	 Let's see... the monitor_refine probably don't need to be running, since nothing is going to be refined
[14:52:13] <wikibugs>	 10Analytics, 10Analytics-Kanban: Change routing to accept a list of wikis in URL - https://phabricator.wikimedia.org/T283596 (10fdans)
[14:52:23] <elukey>	 ah and drop_event.timer
[14:52:37] <elukey>	 razzi: yes but better to stop all of them
[14:54:57] <wikibugs>	 (03Abandoned) 10Hnowlan: Update aqs to 60c2b70 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/679333 (owner: 10Hnowlan)
[14:55:39] <elukey>	 judging from https://yarn.wikimedia.org/cluster/apps/RUNNING there are some things that may need a follow up while we wait
[14:55:50] <elukey>	 for example, search still runs a flink cluster
[14:56:08] <ottomata>	 pretty sure that is just for devleopment, be good to ask them to shut it down
[14:56:11] <ottomata>	 but if they don't respond i think we can
[14:56:25] <ottomata>	 ping dcausse ^
[14:57:04] <dcausse>	 reading backlog
[14:57:39] <dcausse>	 I can shutdown flink indeed
[14:57:56] <dcausse>	 no problem to kill it as well, I restart it when it crashes
[14:58:33] <dcausse>	 it's being used to test the pipeline in pre-prod (https://query-preview.wikidata.org)
[14:59:27] <elukey>	 <3
[14:59:34] <ottomata>	 joal:  q, before I go about revert + stopping those c3 jobs (not doing today) 
[14:59:47] <ottomata>	 isi there an eta on fix? I guess a while right, lots  of testing for new spark loading?
[15:01:46] <dcausse>	 elukey: should be gone
[15:01:52] <razzi>	 Ok we're officially in the maintenance window, and there are 14 applications running
[15:03:21] <razzi>	 there are 2 analytics ones I think we can stop no problem
[15:03:37] <razzi>	 The other ones seem to be a mix of research and product analytics users
[15:12:22] <ottomata>	 oh there is a gmodena flink! :)
[15:12:24] <ottomata>	 gmodena:  ok to stop?
[15:12:40] <ottomata>	 razzi: all the wmfdata-yarn spark jobs are likely from jupyter notebooks
[15:12:45] <ottomata>	 should be fine to stop
[15:12:57] <ottomata>	 not as sure about wdqs-analysis or pyspark regular; misalignment
[15:13:07] <ottomata>	 but, i think we are in the announced window  so we should stop them
[15:13:53] <dcausse>	 tanny411: ^
[15:15:49] <tanny411>	 Ahh, its ok to stop. I'll start again later then. 
[15:16:10] <elukey>	 razzi: let's not stop any analytics job, otherwise we'll have to re-run them
[15:16:47] <elukey>	 they should finish in a bit, no more timers scheduled
[15:18:39] <tanny411>	 btw, wdqs-analysis one was mine. When can I re-start?
[15:18:54] <razzi>	 Ok, I'll give them 10 minutes :) not too bad to have to re-run in my opinion
[15:19:35] <razzi>	 tanny411: cluster should be accepting jobs again in ~90 minutes
[15:19:45] <tanny411>	 razzi: great, thanks!
[15:19:57] <ottomata>	 razzi:  lets wait for that analytics job if we can
[15:20:00] <ottomata>	 its actually just one
[15:20:06] <ottomata>	 one of the apps is the oozie launcher
[15:20:10] <ottomata>	 it i a pageview monthly dump
[15:20:24] <ottomata>	 fdans:  is this part of a big backfill?
[15:20:33] <ottomata>	 hive2:W=pageview-monthly_dump-wf-2014-11
[15:20:52] <ottomata>	 i think itis almost done
[15:20:53] <ottomata>	 maybe.
[15:20:57] <ottomata>	 maps are done
[15:21:00] <ottomata>	 final reduce is finishing
[15:21:00] <ottomata>	 https://yarn.wikimedia.org/proxy/application_1620304990193_87521/mapreduce/job/job_1620304990193_87521
[15:21:40] <fdans>	 ottomata: it should be close to done, sorry, did I ignore scheduled maintenance?
[15:22:11] <ottomata>	 no no
[15:22:14] <ottomata>	 i twas probably launched before
[15:22:17] <elukey>	 fdans: how dare you
[15:22:27] <ottomata>	 the scheduled maintanence started 20 mins ago
[15:22:31] <elukey>	 :D
[15:27:23] <ottomata>	 i cannot remeemeber how to set up custom icinga checks
[15:27:37] <ottomata>	 wait i'm going to go say that in -sre
[15:29:00] <elukey>	 razzi: I think that we can proceed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/692465, merge + run puppet on an-masters (without restarts of refresh queues)
[15:29:06] <elukey>	 so we'll be ready when the cluster is drained
[15:29:12] <razzi>	 Sounds good
[15:32:01] <razzi>	 !log disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet
[15:32:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:32:24] <elukey>	 razzi: it is fine to run puppet, it will not restart anything
[15:32:31] <elukey>	 but it will update the capacity scheduler's config
[15:32:41] <elukey>	 so we'll be ready to refresh queues when needed
[15:32:57] <razzi>	 oh right, I did this out of order, have to run puppet before disabling it
[15:35:13] <razzi>	 !log re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
[15:35:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:36:18] <razzi>	 !log disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again
[15:36:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:39:40] <elukey>	 just tried spark2-shell --master yarn from stat1004
[15:39:41] <elukey>	 org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1620304990193_87525 to YARN : org.apache.hadoop.security.AccessControlException: Queue root.default is STOPPED. Cannot accept submission of application: application_1620304990193_87525
[15:39:48] <elukey>	 looks good
[15:41:27] <elukey>	 (maintenance time)
[15:41:27] <razzi>	 cool
[15:44:45] <elukey>	 razzi: ok next steps?
[15:45:39] <razzi>	 After the cluster is empty, enable safe mode. Still have 11 running applications
[15:46:09] <elukey>	 yes but those are notebooks/flink, that most of the time are long lived ones (even if they are not doing anything)
[15:46:26] <elukey>	 for example a lot of people keep their notebook running even if they are not executing queries etc..
[15:46:28] <razzi>	 Oh ok, so we don't even have to stop them?
[15:47:16] <elukey>	 we can in theory avoid to stop them, since safe mode will be ok in theory, but I see that gmodena's flink cluster seems to have a few things running (not sure if it is a problem or not)
[15:48:04] <elukey>	 to be on the safe side, let's kill the apps
[15:48:15] <ottomata>	 yeah +1 to killing them
[15:48:48] <elukey>	 yarn application -kill $appid
[15:48:55] <elukey>	 razzi: --6
[15:48:57] <elukey>	 --^
[15:49:02] <elukey>	 then we are free to start
[15:49:16] <razzi>	 ok, killing the remaining jobs
[15:51:28] <razzi>	 Cluster is empty, enabling safe mode
[15:51:52] <razzi>	 !log enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
[15:51:58] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:52:21] <razzi>	 !log checkpoint hdfs with  sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
[15:52:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:54:59] <razzi>	 Hm "Save namespace failed for an-master1001.eqiad.wmnet/10.64.5.26:8020"
[15:55:13] <elukey>	 did it say why?
[15:56:10] <elukey>	 razzi: more info the in hdfs-namenode log
[15:56:27] <razzi>	 > saveNamespace: End of File Exception between local host is: "an-master1001/10.64.5.26"; destination host is: "an-master1001.eqiad.wmnet":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
[15:56:37] <razzi>	 > Save namespace successful for an-master1002.eqiad.wmnet/10.64.21.110:8020
[15:56:41] <razzi>	 huh
[15:56:46] <elukey>	 the quorum that is mentioned is the journal manager one
[15:56:55] <elukey>	 I'd retry again
[15:57:20] <razzi>	 re-ran, failed again
[15:58:27] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[15:58:52] <elukey>	 ooff
[15:59:25] <razzi>	 an-master1002 is active, hadoop-hdfs-namenode stopped on an-master1001
[15:59:44] <razzi>	 Hooray for HA
[15:59:59] <gmodena>	 elukey sure!
[16:00:25] <elukey>	 razzi: what would you do now? 
[16:00:32] <elukey>	 gmodena: already killed np :)
[16:01:36] <razzi>	 I'd like to get an-master1001 hadoop-hdfs-namenode back up an running, reading the logs trying to figure out what's happening
[16:03:11] <joal>	 sorry to be late folks - I can help from now on
[16:03:27] <elukey>	 there are multiple options - this seems to be a failure in one of the namenodes, so we can try to start it again
[16:04:36] <wikibugs>	 (03PS1) 10Mforns: Add dropped partitions and deleted directory size limits [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433)
[16:05:00] <elukey>	 as far as I can see it failed to achieve quorum with the journal nodes, and then it decided to shutdown
[16:05:33] <elukey>	 since we have only one namenode up now, we should try to get 1001 back up and running and see if it bootstraps cleanly
[16:05:40] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process
[16:05:46] <joal>	 \o/
[16:06:02] <razzi>	 !log sudo systemctl restart hadoop-hdfs-namenode
[16:06:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:06:06] <joal>	 elukey summons namenodes - this man is inbelievable
[16:06:11] <joal>	 :)
[16:06:55] <elukey>	 well razzi is ordering to namenodes what to do this time, not me :)
[16:07:20] <joal>	 Indeed! thank you razzi :)
[16:08:19] <elukey>	 https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=26&orgId=1 is something that happens when the namenode is restarted
[16:08:26] <elukey>	 we need to wait for a complete recovery
[16:08:42] <elukey>	 safe mode is still on since 1002 is the master, so there are some garbage in the logs
[16:08:57] <elukey>	 also get service state for 1001 needs to come back clean
[16:09:53] <elukey>	 the namenode is not really great in recovering from journal node corner cases
[16:12:36] <elukey>	 ok so we can try to fail back in ~10 mins when things are settled, and see if any metric looks weird (blocks etc..)
[16:12:49] <elukey>	 then we should still be in safe mode, and we can re-attempt a save namespace
[16:13:04] <elukey>	 if it doesn't work, we should just leave maintenance and understand what's happening
[16:13:15] <elukey>	 it may be a new bug/behavior of 2.10
[16:13:28] <elukey>	 (hopefully not, I am inclined to bet on a temporary glitch)
[16:15:00] <elukey>	 razzi: does it make sense --^ ?
[16:16:16] <razzi>	 Yes, watching metrics and waiting for things to settle
[16:16:23] <wikibugs>	 10Analytics-Kanban: Request for Kerberos password for kzimmerman - https://phabricator.wikimedia.org/T283386 (10Ottomata) a:03Ottomata
[16:17:22] <elukey>	 also logs
[16:17:49] <razzi>	 an-master1001 is standby now, which is good
[16:18:48] <razzi>	 journalctl -u hadoop-hdfs-namenode on an-master1001 strangely has had no logs for 12 minutes, last line was "an-master1001 systemd[1]: Started LSB: Hadoop namenode.", I guess it's still starting
[16:20:23] <elukey>	 razzi: the logs are in /var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log
[16:20:35] <elukey>	 err
[16:20:40] <joal>	 activity looks flat currently
[16:21:04] <elukey>	 /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log
[16:21:25] <elukey>	 2021-05-25 16:20:57,550 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
[16:21:28] <elukey>	 java.util.concurrent.ExecutionException: java.io.IOException: Cannot find any valid remote NN to service request!
[16:21:31] <elukey>	 this is not great
[16:21:55] <elukey>	 ah yes because of safe mode
[16:22:05] <elukey>	 ok we should turn safe mode off in my opinion
[16:22:12] <elukey>	 just to allow 1001 to fully recover
[16:22:42] <razzi>	 sounds good, will do
[16:23:56] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
[16:23:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:24:16] <wikibugs>	 10Analytics-Kanban: Request for Kerberos password for kzimmerman - https://phabricator.wikimedia.org/T283386 (10Ottomata) Ok!  You should receive an email asking you to login and create a password.  After you do that, you can close this ticket.  Thank you!
[16:24:19] <elukey>	 perfect logs look good
[16:25:36] <elukey>	 so I think that we can try the failback
[16:25:43] <elukey>	 and wait 5 mins to check metrics
[16:27:37] <elukey>	 razzi: --^
[16:28:05] <razzi>	 ok, will failover 1002 to 1001
[16:28:18] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
[16:28:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:28:23] <razzi>	 > Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful
[16:28:39] <razzi>	 (I see the word "fail" and worry, but it's good :)
[16:29:12] <elukey>	 metrics are good but let's wait a couple of minutes
[16:30:21] <elukey>	 I see gc activity for 1001, but it must be housekeeping as active
[16:30:35] <razzi>	 gc time is spiking for 1001, yeah
[16:31:27] <fdans>	 a-team retro?
[16:32:09] <razzi>	 I'm going to focus on hadoop for the moment fdans
[16:34:09] <elukey>	 ottomata, razzi - I need to run in ~30 mins for dinner, didn't expect a longer maintenance, we can decide what to do now
[16:34:32] <razzi>	 ack
[16:34:33] <elukey>	 I think that when we pass the backup fsimage step it should be relatively easy
[16:35:14] <elukey>	 razzi: ok so let's retry safemode + save namespace, this time let's wait a couple of minutes between steps
[16:35:25] <razzi>	 sounds good, here goes
[16:35:33] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
[16:35:35] <elukey>	 it shouldn't matter but we'll see
[16:35:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:38:09] <elukey>	 razzi: let's go with save namespace
[16:38:39] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
[16:38:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:40:21] <elukey>	 nope
[16:40:26] <elukey>	 failed again
[16:40:32] <joal>	 :(
[16:40:37] <razzi>	 :( indeed
[16:40:39] <joal>	 journalnode problem?
[16:41:41] <elukey>	 razzi: let's redo the procedure to start the namenode on 1001, wait a bit, failback, etc.. and call this maintenance over
[16:42:21] <razzi>	 yeah, good to get back to a known good state
[16:42:22] <elukey>	 joal: I suspect it may be a heap size issue, there is a thread dump plus error messages related to quorum failures for journal
[16:43:34] <razzi>	 !log sudo systemctl restart hadoop-hdfs-namenode on an-master1001
[16:43:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:45:34] <joal>	 elukey: as in for instance heap-size being too big to get all the data from journal and therefore not reaching quorum?
[16:46:41] <elukey>	 joal: it smells as if the namenode uses the heap size to compute the fsimage, fetching data from the edit log etc.. but somehow failing to do it
[16:46:50] <joal>	 hm
[16:47:10] <elukey>	 it is strange since it recovers nicely after a restart
[16:47:18] <elukey>	 without any inconsistency
[16:47:35] <elukey>	 and I see a  thread dump when the savenamespace is issued
[16:49:51] <joal>	 Could it be that the journal nodes have inconsistencies that don't show up except in fsimage generation?
[16:50:39] <elukey>	 the fsimage gets saved on the standby nn IIUC
[16:50:43] <elukey>	 but not in the primary
[16:50:45] <elukey>	 right razzi ?
[16:50:52] <razzi>	 yes indeed
[16:51:03] <elukey>	 also I'd expect more weirdnesses if the edit log was inconsistent
[16:51:23] <elukey>	 also we should have timer that triggers the fs image generation
[16:52:45] <elukey>	 yes on 1002, it does '/usr/bin/hdfs', 'dfsadmin', '-fetchImage' etc..
[16:52:49] <elukey>	 but no failures in lgos
[16:52:50] <elukey>	 *logs
[16:53:06] <joal>	 that's weird
[16:53:21] <joal>	 Could we try to generate the image on 1 when 2 is masteR?
[16:53:50] <elukey>	 if you want to do it +1 but I am going out for dinner in 5 mins :)
[16:53:58] <joal>	 ack elukey :)
[16:54:21] * joal is not root and doesn't suggest weird stuff :)
[16:55:38] <razzi>	 an-master1001 is up as standby
[16:55:44] <razzi>	 metrics look healthy
[16:56:49] <elukey>	 razzi: I agree, wait 5 mins just in case and then failback
[16:57:05] <elukey>	 if you can open a task collecting info it would be great
[16:57:11] <elukey>	 but I bet on heap size
[16:57:27] <elukey>	 I'd just bump the heap size to the next step that we have in our docs
[16:57:33] <elukey>	 (ahead of time)
[16:57:39] <elukey>	 but those are my 2c :)
[16:58:00] <elukey>	 we can also test if savenamespace leads to problems in hadoop test
[16:59:15] <razzi>	 good ideas, thanks for your support this time around, draining the cluster wasn't too bad, I'm sure we can do so again soon
[16:59:43] <joal>	 there still is some activity on an-master1002 (sent bytes + rpc calls) - I wonder if it's expected
[17:03:26] <razzi>	 !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
[17:03:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:04:52] <razzi>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
[17:04:54] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:07:59] <razzi>	 !log re-enabled puppet on an-masters and an-launcher
[17:08:02] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:08:55] <razzi>	 Ok, the cluster is back up and running, jobs are running again
[17:09:14] <razzi>	 Will gather logs on fsimage error and write up in task
[17:09:33] * elukey afk!
[17:10:10] <joal>	 I will monitor jobs on the cluster razzi - thanks for resetting !
[17:16:14] <joal>	 razzi: dumb question - Have you reset the yarn queue to accept jobs\/
[17:16:25] <joal>	 ?
[17:18:57] <joal>	 ok I confirm this has not been done - ping razzi or ottomata :)
[17:19:22] <razzi>	 Oops! Yeah let me do that
[17:19:31] <joal>	 thanks razzi :)
[17:19:44] <razzi>	 Smar question :)
[17:19:48] <razzi>	 'smart
[17:21:23] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:27:36] <razzi>	 ^ that is because the yarn queue is not accepting
[17:28:08] <razzi>	 !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again
[17:28:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:28:46] <razzi>	 !log sudo systemctl restart refine_eventlogging_legacy
[17:28:47] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:30:43] <joal>	 Still not working razzi - Have restarted the ResourceManager?
[17:31:07] <joal>	 hum - the refreshQueue should have been enough
[17:31:23] <icinga-wm>	 PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:33:08] <wikibugs>	 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey I have this on my plan for tomorrow morning. i'll update the task once the move is complete.
[17:34:06] <razzi>	 oh, I bet I have to set status to STARTED, just removing STOPPED isn't enough
[17:34:15] <icinga-wm>	 PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:41:00] <joal>	 Ah! interesting razzi 
[17:50:59] <razzi>	 Hm, that didn't fix it
[17:52:55] <razzi>	 !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002
[17:52:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:01:32] <razzi>	 !log manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
[18:01:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:04:01] <razzi>	 Ok, production queue is accepting again
[18:04:12] <joal>	 \o/
[18:04:23] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:04:29] <joal>	 Thanks razzi - How have you managed to make it work?
[18:04:43] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:06:15] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:06:25] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:06:39] <razzi>	 Had to add a bunch of lines to /etc/hadoop/conf/capacity-scheduler.xml, like
[18:06:39] <razzi>	 ```
[18:06:39] <razzi>	   <property>
[18:06:39] <razzi>	     <name>yarn.scheduler.capacity.root.fifo.state</name>
[18:06:39] <razzi>	     <value>RUNNING</value>
[18:06:39] <razzi>	   </property>
[18:06:39] <razzi>	 ```
[18:06:40] <razzi>	 for fifo,default,production,essential
[18:07:03] <razzi>	 I may have missed some, wish there was a --recursive or something
[18:08:36] <razzi>	 ok, based on `mapred queue -list`, that's all of them
[18:08:58] <joal>	 man that's uncool
[18:09:08] <joal>	 ok, next time we'll stop he default queue only!
[18:09:13] <joal>	 razzi: --^
[18:11:57] <razzi>	 Ok maintenance is officially over, yarn is accepting, hdfs is writeable, alas an-masters still run stretch
[18:12:32] <joal>	 \o/
[18:12:33] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:14:49] <razzi>	 !log sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service
[18:14:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:15:07] <icinga-wm>	 RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:15:40] <ottomata>	 nice razzi !!! :)
[18:16:49] <razzi>	 !log sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002
[18:16:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:16:57] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:22:59] <icinga-wm>	 RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:23:13] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:25:03] <icinga-wm>	 RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:25:50] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:26:11] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:30:23] * razzi cheers at the recoveries
[18:30:35] * razzi lunchtime
[18:38:35] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:40:05] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:46:21] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:48:57] <icinga-wm>	 PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:51:59] <wikibugs>	 (03PS1) 10Fdans: Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596)
[18:53:25] <wikibugs>	 (03PS2) 10Fdans: Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596)
[19:01:31] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:07:49] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:09:29] <wikibugs>	 10Analytics, 10Analytics-Kanban: Change state to store project as an array - https://phabricator.wikimedia.org/T283624 (10fdans)
[19:10:23] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:10:43] <icinga-wm>	 RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:14:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:33:13] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:00:07] <ottomata>	 razzi: am available lemme know if you are back from lunch and ready to meet
[20:01:29] <razzi>	 ottomata: sounds good, ready to meet in 2 mins
[20:01:35] <ottomata>	 k bc? 
[20:04:29] <razzi>	 yep!
[20:04:32] <razzi>	 ottomata: 
[20:04:46] <ottomata>	 k
[20:27:28] <wm-bot>	 I trust:  urbanecm!.*@user/urbanecm (2admin), .*@user/urbanecmbackup/x-3733651 (2admin),
[20:27:28] <urbanecm|backup>	 @trusted
[20:27:44] <wm-bot>	 Successfully added .*@wikimedia/Martin-Urbanec
[20:27:44] <urbanecm|backup>	 @trustadd .*@wikimedia/Martin-Urbanec admin
[20:31:42] <wm-bot>	 You are admin and identified by the name .*@wikimedia/Martin-Urbanec
[20:31:42] <urbanecm>	 @whoami
[20:31:58] <wm-bot>	 User was deleted from access list
[20:31:58] <urbanecm>	 @trustdel urbanecm!.*@user/urbanecm
[20:32:03] <wm-bot>	 I trust:  .*@user/urbanecmbackup/x-3733651 (2admin), .*@wikimedia/Martin-Urbanec (2admin),
[20:32:03] <urbanecm>	 @trusted
[20:50:08] <wikibugs>	 10Analytics, 10EventStreams, 10Patch-Needs-Improvement, 10Services (watching): EventStreams process occasionally OOMs - https://phabricator.wikimedia.org/T210741 (10Aklapper)
[20:56:10] <wikibugs>	 10Analytics, 10EventStreams, 10Patch-Needs-Improvement, 10Services (watching): EventStreams process occasionally OOMs - https://phabricator.wikimedia.org/T210741 (10Ottomata) 05Open→03Declined
[21:07:20] <wikibugs>	 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10jmixter) yeah sorry about that. I think this was a symptom of me being new and not having any idea what I was doing. I think things are resolved now.
[21:10:12] <wikibugs>	 (03PS5) 10Aklapper: Create simple CLI management tool [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/512614 (https://phabricator.wikimedia.org/T224376) (owner: 10Framawiki)
[21:21:02] <wikibugs>	 10Analytics, 10Discovery, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10BPirkle)
[23:53:49] <wikibugs>	 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10Dzahn) 05Open→03Resolved a:03Dzahn @jmixter Cool, great to hear that things work for you now and thanks for confirming.  I think the wiki editin...