[00:34:40] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet [00:34:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [00:35:42] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Better Use Of Data, and 4 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [01:07:47] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Upgrade the Hadoop masters to Debian Buster - https://phabricator.wikimedia.org/T278423 (10razzi) I was able to failover using `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet`, everything seemed to... [01:07:52] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet [01:07:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [01:08:22] Oops I meant to do that the other way this time [01:08:28] Not a problem, 1002 is still active [01:08:42] (I should be a bit more careful saying "oops" when it comes to hdfs namenode) [01:09:40] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [01:09:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [04:29:18] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) Transfer started. [06:13:21] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T283562 (10MS.NIMO) [06:21:52] Good morning [06:27:28] bonjour [06:33:26] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10elukey) @razzi I forgot to mention that DNS CNAME/SRV records are also to update, otherwise the various tools that we use will not work: ` templates/wmnet:s2... [07:13:39] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10GoranSMilovanovic) @mforns For the New Editors campaign analytics we do not need those two fields. Let's hear from the WMDE FUN team if they do. [08:43:42] (03CR) 10Michael Große: [C: 04-1] "Ok, so based on the conversation in T281356, it seems we have now two ways forward with this patch:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [08:53:30] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) I have had to remove the ipv6 dns due to: T270101 [09:02:10] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi the data is cloned, however the host cannot reach any of the masters, I guess there are some FW/VLAN rules that need changing? I am checkin... [09:02:48] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) ` root@dbstore1006:/srv# telnet db1122.eqiad.wmnet 3306 Trying 10.64.48.34... ^C root@dbstore1006:/srv# telnet db1123.eqiad.wmnet 3306 Trying 10.6... [09:03:23] (03PS2) 10Ladsgroup: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) [09:03:45] (03CR) 10Ladsgroup: "> Patch Set 1:" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [09:04:15] (03CR) 10jerkins-bot: [V: 04-1] Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [09:12:43] (03PS3) 10Ladsgroup: Make recent_changes_by_namespace track all namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) [09:30:51] (03CR) 10Ladsgroup: "Tested, works fine: https://grafana.wikimedia.org/d/000000162/wikidata-site-stats?viewPanel=23&orgId=1" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/692259 (https://phabricator.wikimedia.org/T281356) (owner: 10Ladsgroup) [11:35:01] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) @razzi @elukey @Ottomata I am afraid we need to re-do all this work. I just noticed that db1125 isn't the standard HW we have, but one of the old... [11:36:31] 10Analytics-Clusters: Refresh Druid nodes (druid100[1-3]) - https://phabricator.wikimedia.org/T255148 (10hnowlan) a:05hnowlan→03None [12:20:20] 10Analytics-Clusters, 10Analytics-Kanban, 10DBA, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Marostegui) I have taken care of all the stuff from our side, so db1183 is now ready to be reimaged at your convenience. Let me know if you want me to decommi... [13:28:06] hellooo [13:29:34] Hi mforns :) [13:29:38] ottomata: you there? [13:29:49] hiya ya! [13:29:54] Hi! [13:30:07] ottomata: do you have minute to talk about revert for c*3? [13:30:30] sure [13:30:42] 10Analytics, 10Analytics-Kanban, 10Event-Platform: WMDEBanner* Event Platform Migration - https://phabricator.wikimedia.org/T282562 (10mforns) Thanks @GoranSMilovanovic! Yes, will wait for their thoughts. [13:30:47] cave! [13:31:59] coming internet slow and have to re-sign in [14:14:14] elukey: looking at the systemd multi instance stuff [14:14:41] from what I can tell, systemd::service doesn't quite have the ability to re-use systemd unit @ files [14:14:45] at least, in the examples I can find [14:14:56] 10Analytics: NullPointerException at beginning of spark job - https://phabricator.wikimedia.org/T278451 (10fkaelin) 05Open→03Resolved a:03fkaelin Apologies for the delay - I haven't been running larger avro based jobs recently, and I wasn't able to find a minimal example when I created this task. I am clos... [14:14:59] because systemd::service requires a content param [14:15:09] so every declarration will render a new .service file [14:15:32] ideally we could pre-render e.g. airflow@.service [14:15:40] with the %i part templated as the airflow instance name [14:16:02] and just start /stop by instance name using systemd @ template [14:16:08] right? [14:16:17] it'll still work the same way wiith multiple .service files [14:16:25] they'll just be duplicates of each other [14:29:17] sure but the @name is the native way that we use in systemd to support multi instance [14:29:57] right ok, just checking [14:30:08] IIRC one example should be the kafka burrow stuff [14:30:14] the @name will work fine. but we just miss the benefit of not having to render multiple systemd .sevice files [14:30:22] they'll all be identical though [14:30:59] I think it is fine even if @name then [14:31:06] elukey: burrow doesn't use @name [14:31:08] just looked [14:31:12] eventlogging and prometheus do though [14:31:15] i'm borrowing from those [14:31:18] the prometheus exporters do yes, not the service units [14:31:31] yes yes but I think it is fine even without it if you want [14:31:34] they do acutally! prometheus server [14:31:55] $service_name = "prometheus@${title}" [14:32:11] elukey fine with out it == ? [14:34:27] razzi: o/ reporting for duty! :) [14:34:37] ottomata: I mean if you want to skip the @ it is fine for me :) [14:34:41] oh oh [14:34:43] no i like it elukey [14:34:46] Hi, good morning, I realized when I sent the maintenance email I said we wouldn't start for another 30 minutes [14:34:49] it makes it easier to wildcard and shut things down as needed [14:35:01] just was checking that i wasn't missing somethign about the duplicate .service files [14:35:09] oh ok! [14:35:11] razzi: hi! We can start draining the cluster in the meantime [14:35:20] without applying the yarn patch [14:35:31] yeah I guess that only affects us, good thinking [14:36:18] it will take time to good to start with that [14:37:13] Ok, get ready for a log of !logs [14:37:58] Any particularly destructive step I'll confirm before running, but for starters just disabling puppet and timers on an-launcher [14:39:16] !log stop puppet on an-launcher and stop hadoop-related timers [14:39:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:39:30] k [14:40:33] 10Analytics: Add ignore success flags option to pageview monthly dumps - https://phabricator.wikimedia.org/T283593 (10fdans) [14:46:06] razzi: you missed some timers :) [14:46:27] check systemd list-timers [14:51:52] I just stopped eventlogging_*.timer and monitor_*.timer (so we can save time later on) [14:51:55] Let's see... the monitor_refine probably don't need to be running, since nothing is going to be refined [14:52:13] 10Analytics, 10Analytics-Kanban: Change routing to accept a list of wikis in URL - https://phabricator.wikimedia.org/T283596 (10fdans) [14:52:23] ah and drop_event.timer [14:52:37] razzi: yes but better to stop all of them [14:54:57] (03Abandoned) 10Hnowlan: Update aqs to 60c2b70 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/679333 (owner: 10Hnowlan) [14:55:39] judging from https://yarn.wikimedia.org/cluster/apps/RUNNING there are some things that may need a follow up while we wait [14:55:50] for example, search still runs a flink cluster [14:56:08] pretty sure that is just for devleopment, be good to ask them to shut it down [14:56:11] but if they don't respond i think we can [14:56:25] ping dcausse ^ [14:57:04] reading backlog [14:57:39] I can shutdown flink indeed [14:57:56] no problem to kill it as well, I restart it when it crashes [14:58:33] it's being used to test the pipeline in pre-prod (https://query-preview.wikidata.org) [14:59:27] <3 [14:59:34] joal: q, before I go about revert + stopping those c3 jobs (not doing today) [14:59:47] isi there an eta on fix? I guess a while right, lots of testing for new spark loading? [15:01:46] elukey: should be gone [15:01:52] Ok we're officially in the maintenance window, and there are 14 applications running [15:03:21] there are 2 analytics ones I think we can stop no problem [15:03:37] The other ones seem to be a mix of research and product analytics users [15:12:22] oh there is a gmodena flink! :) [15:12:24] gmodena: ok to stop? [15:12:40] razzi: all the wmfdata-yarn spark jobs are likely from jupyter notebooks [15:12:45] should be fine to stop [15:12:57] not as sure about wdqs-analysis or pyspark regular; misalignment [15:13:07] but, i think we are in the announced window so we should stop them [15:13:53] tanny411: ^ [15:15:49] Ahh, its ok to stop. I'll start again later then. [15:16:10] razzi: let's not stop any analytics job, otherwise we'll have to re-run them [15:16:47] they should finish in a bit, no more timers scheduled [15:18:39] btw, wdqs-analysis one was mine. When can I re-start? [15:18:54] Ok, I'll give them 10 minutes :) not too bad to have to re-run in my opinion [15:19:35] tanny411: cluster should be accepting jobs again in ~90 minutes [15:19:45] razzi: great, thanks! [15:19:57] razzi: lets wait for that analytics job if we can [15:20:00] its actually just one [15:20:06] one of the apps is the oozie launcher [15:20:10] it i a pageview monthly dump [15:20:24] fdans: is this part of a big backfill? [15:20:33] hive2:W=pageview-monthly_dump-wf-2014-11 [15:20:52] i think itis almost done [15:20:53] maybe. [15:20:57] maps are done [15:21:00] final reduce is finishing [15:21:00] https://yarn.wikimedia.org/proxy/application_1620304990193_87521/mapreduce/job/job_1620304990193_87521 [15:21:40] ottomata: it should be close to done, sorry, did I ignore scheduled maintenance? [15:22:11] no no [15:22:14] i twas probably launched before [15:22:17] fdans: how dare you [15:22:27] the scheduled maintanence started 20 mins ago [15:22:31] :D [15:27:23] i cannot remeemeber how to set up custom icinga checks [15:27:37] wait i'm going to go say that in -sre [15:29:00] razzi: I think that we can proceed with https://gerrit.wikimedia.org/r/c/operations/puppet/+/692465, merge + run puppet on an-masters (without restarts of refresh queues) [15:29:06] so we'll be ready when the cluster is drained [15:29:12] Sounds good [15:32:01] !log disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet [15:32:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:32:24] razzi: it is fine to run puppet, it will not restart anything [15:32:31] but it will update the capacity scheduler's config [15:32:41] so we'll be ready to refresh queues when needed [15:32:57] oh right, I did this out of order, have to run puppet before disabling it [15:35:13] !log re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [15:35:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:18] !log disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again [15:36:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:39:40] just tried spark2-shell --master yarn from stat1004 [15:39:41] org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1620304990193_87525 to YARN : org.apache.hadoop.security.AccessControlException: Queue root.default is STOPPED. Cannot accept submission of application: application_1620304990193_87525 [15:39:48] looks good [15:41:27] (maintenance time) [15:41:27] cool [15:44:45] razzi: ok next steps? [15:45:39] After the cluster is empty, enable safe mode. Still have 11 running applications [15:46:09] yes but those are notebooks/flink, that most of the time are long lived ones (even if they are not doing anything) [15:46:26] for example a lot of people keep their notebook running even if they are not executing queries etc.. [15:46:28] Oh ok, so we don't even have to stop them? [15:47:16] we can in theory avoid to stop them, since safe mode will be ok in theory, but I see that gmodena's flink cluster seems to have a few things running (not sure if it is a problem or not) [15:48:04] to be on the safe side, let's kill the apps [15:48:15] yeah +1 to killing them [15:48:48] yarn application -kill $appid [15:48:55] razzi: --6 [15:48:57] --^ [15:49:02] then we are free to start [15:49:16] ok, killing the remaining jobs [15:51:28] Cluster is empty, enabling safe mode [15:51:52] !log enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [15:51:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:52:21] !log checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [15:52:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:54:59] Hm "Save namespace failed for an-master1001.eqiad.wmnet/10.64.5.26:8020" [15:55:13] did it say why? [15:56:10] razzi: more info the in hdfs-namenode log [15:56:27] > saveNamespace: End of File Exception between local host is: "an-master1001/10.64.5.26"; destination host is: "an-master1001.eqiad.wmnet":8020; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException [15:56:37] > Save namespace successful for an-master1002.eqiad.wmnet/10.64.21.110:8020 [15:56:41] huh [15:56:46] the quorum that is mentioned is the journal manager one [15:56:55] I'd retry again [15:57:20] re-ran, failed again [15:58:27] PROBLEM - Hadoop Namenode - Primary on an-master1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [15:58:52] ooff [15:59:25] an-master1002 is active, hadoop-hdfs-namenode stopped on an-master1001 [15:59:44] Hooray for HA [15:59:59] elukey sure! [16:00:25] razzi: what would you do now? [16:00:32] gmodena: already killed np :) [16:01:36] I'd like to get an-master1001 hadoop-hdfs-namenode back up an running, reading the logs trying to figure out what's happening [16:03:11] sorry to be late folks - I can help from now on [16:03:27] there are multiple options - this seems to be a failure in one of the namenodes, so we can try to start it again [16:04:36] (03PS1) 10Mforns: Add dropped partitions and deleted directory size limits [analytics/refinery] - 10https://gerrit.wikimedia.org/r/694547 (https://phabricator.wikimedia.org/T270433) [16:05:00] as far as I can see it failed to achieve quorum with the journal nodes, and then it decided to shutdown [16:05:33] since we have only one namenode up now, we should try to get 1001 back up and running and see if it bootstraps cleanly [16:05:40] RECOVERY - Hadoop Namenode - Primary on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Namenode_process [16:05:46] \o/ [16:06:02] !log sudo systemctl restart hadoop-hdfs-namenode [16:06:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:06:06] elukey summons namenodes - this man is inbelievable [16:06:11] :) [16:06:55] well razzi is ordering to namenodes what to do this time, not me :) [16:07:20] Indeed! thank you razzi :) [16:08:19] https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=26&orgId=1 is something that happens when the namenode is restarted [16:08:26] we need to wait for a complete recovery [16:08:42] safe mode is still on since 1002 is the master, so there are some garbage in the logs [16:08:57] also get service state for 1001 needs to come back clean [16:09:53] the namenode is not really great in recovering from journal node corner cases [16:12:36] ok so we can try to fail back in ~10 mins when things are settled, and see if any metric looks weird (blocks etc..) [16:12:49] then we should still be in safe mode, and we can re-attempt a save namespace [16:13:04] if it doesn't work, we should just leave maintenance and understand what's happening [16:13:15] it may be a new bug/behavior of 2.10 [16:13:28] (hopefully not, I am inclined to bet on a temporary glitch) [16:15:00] razzi: does it make sense --^ ? [16:16:16] Yes, watching metrics and waiting for things to settle [16:16:23] 10Analytics-Kanban: Request for Kerberos password for kzimmerman - https://phabricator.wikimedia.org/T283386 (10Ottomata) a:03Ottomata [16:17:22] also logs [16:17:49] an-master1001 is standby now, which is good [16:18:48] journalctl -u hadoop-hdfs-namenode on an-master1001 strangely has had no logs for 12 minutes, last line was "an-master1001 systemd[1]: Started LSB: Hadoop namenode.", I guess it's still starting [16:20:23] razzi: the logs are in /var/log/hadoop-yarn/yarn-yarn-resourcemanager-an-master1001.log [16:20:35] err [16:20:40] activity looks flat currently [16:21:04] /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1001.log [16:21:25] 2021-05-25 16:20:57,550 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN [16:21:28] java.util.concurrent.ExecutionException: java.io.IOException: Cannot find any valid remote NN to service request! [16:21:31] this is not great [16:21:55] ah yes because of safe mode [16:22:05] ok we should turn safe mode off in my opinion [16:22:12] just to allow 1001 to fully recover [16:22:42] sounds good, will do [16:23:56] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave [16:23:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:24:16] 10Analytics-Kanban: Request for Kerberos password for kzimmerman - https://phabricator.wikimedia.org/T283386 (10Ottomata) Ok! You should receive an email asking you to login and create a password. After you do that, you can close this ticket. Thank you! [16:24:19] perfect logs look good [16:25:36] so I think that we can try the failback [16:25:43] and wait 5 mins to check metrics [16:27:37] razzi: --^ [16:28:05] ok, will failover 1002 to 1001 [16:28:18] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [16:28:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:28:23] > Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8040 successful [16:28:39] (I see the word "fail" and worry, but it's good :) [16:29:12] metrics are good but let's wait a couple of minutes [16:30:21] I see gc activity for 1001, but it must be housekeeping as active [16:30:35] gc time is spiking for 1001, yeah [16:31:27] a-team retro? [16:32:09] I'm going to focus on hadoop for the moment fdans [16:34:09] ottomata, razzi - I need to run in ~30 mins for dinner, didn't expect a longer maintenance, we can decide what to do now [16:34:32] ack [16:34:33] I think that when we pass the backup fsimage step it should be relatively easy [16:35:14] razzi: ok so let's retry safemode + save namespace, this time let's wait a couple of minutes between steps [16:35:25] sounds good, here goes [16:35:33] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [16:35:35] it shouldn't matter but we'll see [16:35:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:38:09] razzi: let's go with save namespace [16:38:39] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [16:38:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:40:21] nope [16:40:26] failed again [16:40:32] :( [16:40:37] :( indeed [16:40:39] journalnode problem? [16:41:41] razzi: let's redo the procedure to start the namenode on 1001, wait a bit, failback, etc.. and call this maintenance over [16:42:21] yeah, good to get back to a known good state [16:42:22] joal: I suspect it may be a heap size issue, there is a thread dump plus error messages related to quorum failures for journal [16:43:34] !log sudo systemctl restart hadoop-hdfs-namenode on an-master1001 [16:43:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:45:34] elukey: as in for instance heap-size being too big to get all the data from journal and therefore not reaching quorum? [16:46:41] joal: it smells as if the namenode uses the heap size to compute the fsimage, fetching data from the edit log etc.. but somehow failing to do it [16:46:50] hm [16:47:10] it is strange since it recovers nicely after a restart [16:47:18] without any inconsistency [16:47:35] and I see a thread dump when the savenamespace is issued [16:49:51] Could it be that the journal nodes have inconsistencies that don't show up except in fsimage generation? [16:50:39] the fsimage gets saved on the standby nn IIUC [16:50:43] but not in the primary [16:50:45] right razzi ? [16:50:52] yes indeed [16:51:03] also I'd expect more weirdnesses if the edit log was inconsistent [16:51:23] also we should have timer that triggers the fs image generation [16:52:45] yes on 1002, it does '/usr/bin/hdfs', 'dfsadmin', '-fetchImage' etc.. [16:52:49] but no failures in lgos [16:52:50] *logs [16:53:06] that's weird [16:53:21] Could we try to generate the image on 1 when 2 is masteR? [16:53:50] if you want to do it +1 but I am going out for dinner in 5 mins :) [16:53:58] ack elukey :) [16:54:21] * joal is not root and doesn't suggest weird stuff :) [16:55:38] an-master1001 is up as standby [16:55:44] metrics look healthy [16:56:49] razzi: I agree, wait 5 mins just in case and then failback [16:57:05] if you can open a task collecting info it would be great [16:57:11] but I bet on heap size [16:57:27] I'd just bump the heap size to the next step that we have in our docs [16:57:33] (ahead of time) [16:57:39] but those are my 2c :) [16:58:00] we can also test if savenamespace leads to problems in hadoop test [16:59:15] good ideas, thanks for your support this time around, draining the cluster wasn't too bad, I'm sure we can do so again soon [16:59:43] there still is some activity on an-master1002 (sent bytes + rpc calls) - I wonder if it's expected [17:03:26] !log sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [17:03:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:04:52] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave [17:04:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:07:59] !log re-enabled puppet on an-masters and an-launcher [17:08:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:08:55] Ok, the cluster is back up and running, jobs are running again [17:09:14] Will gather logs on fsimage error and write up in task [17:09:33] * elukey afk! [17:10:10] I will monitor jobs on the cluster razzi - thanks for resetting ! [17:16:14] razzi: dumb question - Have you reset the yarn queue to accept jobs\/ [17:16:25] ? [17:18:57] ok I confirm this has not been done - ping razzi or ottomata :) [17:19:22] Oops! Yeah let me do that [17:19:31] thanks razzi :) [17:19:44] Smar question :) [17:19:48] 'smart [17:21:23] PROBLEM - Check unit status of refine_eventlogging_legacy on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:27:36] ^ that is because the yarn queue is not accepting [17:28:08] !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again [17:28:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:28:46] !log sudo systemctl restart refine_eventlogging_legacy [17:28:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:30:43] Still not working razzi - Have restarted the ResourceManager? [17:31:07] hum - the refreshQueue should have been enough [17:31:23] PROBLEM - Check unit status of refine_event on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:33:08] 10Analytics-Radar, 10SRE, 10ops-eqiad: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Cmjohnson) @elukey I have this on my plan for tomorrow morning. i'll update the task once the move is complete. [17:34:06] oh, I bet I have to set status to STARTED, just removing STOPPED isn't enough [17:34:15] PROBLEM - Check unit status of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:41:00] Ah! interesting razzi [17:50:59] Hm, that didn't fix it [17:52:55] !log sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002 [17:52:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:01:32] !log manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [18:01:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:04:01] Ok, production queue is accepting again [18:04:12] \o/ [18:04:23] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:04:29] Thanks razzi - How have you managed to make it work? [18:04:43] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:06:15] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:06:25] RECOVERY - Check unit status of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:06:39] Had to add a bunch of lines to /etc/hadoop/conf/capacity-scheduler.xml, like [18:06:39] ``` [18:06:39] [18:06:39] yarn.scheduler.capacity.root.fifo.state [18:06:39] RUNNING [18:06:39] [18:06:39] ``` [18:06:40] for fifo,default,production,essential [18:07:03] I may have missed some, wish there was a --recursive or something [18:08:36] ok, based on `mapred queue -list`, that's all of them [18:08:58] man that's uncool [18:09:08] ok, next time we'll stop he default queue only! [18:09:13] razzi: --^ [18:11:57] Ok maintenance is officially over, yarn is accepting, hdfs is writeable, alas an-masters still run stretch [18:12:32] \o/ [18:12:33] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:14:49] !log sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service [18:14:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:15:07] RECOVERY - Check unit status of refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:15:40] nice razzi !!! :) [18:16:49] !log sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002 [18:16:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:16:57] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:22:59] RECOVERY - Check unit status of drop_event on an-launcher1002 is OK: OK: Status of the systemd unit drop_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:23:13] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:25:03] RECOVERY - Check unit status of refine_event on an-launcher1002 is OK: OK: Status of the systemd unit refine_event https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:25:50] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:26:11] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:30:23] * razzi cheers at the recoveries [18:30:35] * razzi lunchtime [18:38:35] PROBLEM - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:40:05] PROBLEM - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:46:21] PROBLEM - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:48:57] PROBLEM - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:51:59] (03PS1) 10Fdans: Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596) [18:53:25] (03PS2) 10Fdans: Alter routing logic to allow value lists [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/694634 (https://phabricator.wikimedia.org/T283596) [19:01:31] RECOVERY - Check unit status of eventlogging_to_druid_editattemptstep_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_editattemptstep_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:07:49] RECOVERY - Check unit status of eventlogging_to_druid_navigationtiming_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:09:29] 10Analytics, 10Analytics-Kanban: Change state to store project as an array - https://phabricator.wikimedia.org/T283624 (10fdans) [19:10:23] RECOVERY - Check unit status of eventlogging_to_druid_netflow_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:10:43] RECOVERY - Check unit status of eventlogging_to_druid_prefupdate_hourly on an-launcher1002 is OK: OK: Status of the systemd unit eventlogging_to_druid_prefupdate_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:14:09] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:33:13] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:00:07] razzi: am available lemme know if you are back from lunch and ready to meet [20:01:29] ottomata: sounds good, ready to meet in 2 mins [20:01:35] k bc? [20:04:29] yep! [20:04:32] ottomata: [20:04:46] k [20:27:28] I trust: urbanecm!.*@user/urbanecm (2admin), .*@user/urbanecmbackup/x-3733651 (2admin), [20:27:28] @trusted [20:27:44] Successfully added .*@wikimedia/Martin-Urbanec [20:27:44] @trustadd .*@wikimedia/Martin-Urbanec admin [20:31:42] You are admin and identified by the name .*@wikimedia/Martin-Urbanec [20:31:42] @whoami [20:31:58] User was deleted from access list [20:31:58] @trustdel urbanecm!.*@user/urbanecm [20:32:03] I trust: .*@user/urbanecmbackup/x-3733651 (2admin), .*@wikimedia/Martin-Urbanec (2admin), [20:32:03] @trusted [20:50:08] 10Analytics, 10EventStreams, 10Patch-Needs-Improvement, 10Services (watching): EventStreams process occasionally OOMs - https://phabricator.wikimedia.org/T210741 (10Aklapper) [20:56:10] 10Analytics, 10EventStreams, 10Patch-Needs-Improvement, 10Services (watching): EventStreams process occasionally OOMs - https://phabricator.wikimedia.org/T210741 (10Ottomata) 05Open→03Declined [21:07:20] 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10jmixter) yeah sorry about that. I think this was a symptom of me being new and not having any idea what I was doing. I think things are resolved now. [21:10:12] (03PS5) 10Aklapper: Create simple CLI management tool [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/512614 (https://phabricator.wikimedia.org/T224376) (owner: 10Framawiki) [21:21:02] 10Analytics, 10Discovery, 10Event-Platform, 10Product-Data-Infrastructure, and 2 others: Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10BPirkle) [23:53:49] 10Analytics-Radar, 10LDAP-Access-Requests, 10SRE, 10SRE-Access-Requests: Account setup issues for jmixter-ctr - https://phabricator.wikimedia.org/T283250 (10Dzahn) 05Open→03Resolved a:03Dzahn @jmixter Cool, great to hear that things work for you now and thanks for confirming. I think the wiki editin...