[00:31:05] <icinga-wm_>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:26:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (6) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:30:25] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:40:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:43:16] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi)
[07:09:07] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MoritzMuehlenhoff)
[07:24:11] <wikibugs>	 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 (10MoritzMuehlenhoff) >>! In T273026#8733992, @cmooney wrote: > Must be a race condition of some kind I'm guessing but not sure what it might be.  Pro...
[07:24:21] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Ladsgroup) MW section masters:  - db1100: s5  - db1131: s6  - db1181: s7  Need to downtime the whole sections for these. I'll do it a b...
[07:34:28] <aqu>	 !log Rerun refine_event with "sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --table_include_regex='mediawiki_visual_editor_feature_use|mediawiki_edit_attempt|mediawiki_web_ui_interactions' --since='2023-04-02T18:00:00.000Z' --until='2023-04-03T19:00:00.000Z'"
[07:34:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:01:17] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:05:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:13:47] <wikibugs>	 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10dcausse) @Ottomata yes the job running with the flink-operator on the dse is using checkpoints, it can be used to experiment with zookeeper, w...
[08:25:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:35:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:51:31] <wikibugs>	 10Data-Engineering, 10DBA, 10Data-Services: Prepare and check storage layer for ckbwiktionary - https://phabricator.wikimedia.org/T331834 (10Ladsgroup) a:05Ladsgroup→03None I created the database and gave the rights to labsdbuser, it's now data engineering's turn to run their scripts.
[09:32:48] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10Radim.kubacki)
[09:46:40] <wikibugs>	 (03PS1) 10Barakat Ajadi: Navtiming: Add longtask task and longtask duration before FCP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477)
[09:51:46] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] sanitization: Remove some NavigationTiming retentions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/904660 (owner: 10Krinkle)
[09:52:37] <wikibugs>	 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10elukey) Zookeeper is probably going to be supported for a long time at the WMF, it is mostly Kafka related but migrating away from it means:...
[10:14:54] <wikibugs>	 (03CR) 10Phedenskog: [C: 03+2] "Looks good!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477) (owner: 10Barakat Ajadi)
[10:15:47] <wikibugs>	 (03Merged) 10jenkins-bot: Navtiming: Add longtask task and longtask duration before FCP [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/905589 (https://phabricator.wikimedia.org/T327477) (owner: 10Barakat Ajadi)
[10:20:31] <elukey>	 steve_munene: o/
[10:20:56] <elukey>	 do you need any review/support/etc.. for today's row C maintenance?
[10:32:28] <elukey>	 afaics from https://phabricator.wikimedia.org/T331882 there is quite a bit of work to do, starting in an hour
[10:33:41] <elukey>	 joal: around by any chance?
[10:41:10] <elukey>	 Sent the email to analytics-announce for the matomo/superset/turnilo downtime
[10:41:56] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10elukey)
[10:45:49] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/puppet/+/905596 for yarn and gobblin
[10:46:22] <elukey>	 DE folks - is there anybody that can assist me when I stop yarn queues and gobblin timers?
[10:46:30] <elukey>	 (going out for a quick lunch, back in a bit)
[10:48:44] <steve_munene>	 hi elukey yes I do sending out a  notification for hadoop and yarn
[10:49:23] <elukey>	 steve_munene: ack perfect, I sent one for Turnilo/Superset/Matomo
[10:49:44] <elukey>	 there is also the code review out, I didn't see you online so I went ahead and created one
[10:50:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:50:13] <elukey>	 going out for lunch, will be back in a bit. Lemme know if you need help from me :)
[11:05:02] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:25:45] <jynus>	 a-team
[11:27:17] <elukey>	 jynus: o/ I think that Steve is out for lunch, anything that you need?
[11:27:53] <jynus>	 yes, we need someone with pageviews general knowledge
[11:28:28] <jynus>	 its an ongoing incident- not an emergency, but relatively time sensitive
[11:29:34] <jynus>	 I can maybe ping Steve after he comes back
[11:30:22] <elukey>	 but is it related to the infra or to the Pageview content? 
[11:30:27] <elukey>	 I can try to help if needed
[11:30:30] <jynus>	 both
[11:30:50] <jynus>	 you can read backlog on a channel you are
[11:32:30] <jynus>	 search for "data engineering" mention, but we may need someone with specific pageviews knowledge
[11:32:43] <elukey>	 yep I think so
[11:33:05] <jynus>	 elukey: do you know at least if they have a general contact email?
[11:33:48] <elukey>	 there is a public one IIRC, not sure about any internal ones, probably they all use slack
[11:33:59] <elukey>	 I'd suggest to follow up with either mforns or milimetric 
[11:34:08] <jynus>	 yep, I tried 
[11:34:59] <jynus>	 will try a bit later
[11:35:05] <jynus>	 thank you, elukey
[11:40:58] <milimetric>	 hi jynus, what channel?
[11:41:26] <jynus>	 milimetric: I PM you
[11:41:52] <milimetric>	 (btw our team email is data-engineering-team)
[11:42:47] <elukey>	 !log stop puppet on an-launcher1002 and manually stop .timer units
[11:42:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:44:23] <elukey>	 steve_munene: not sure what is the procedure, but to avoid stopping ongoing jobs on an-launcher1002 I just stopped the relevant .timer systemd units
[11:44:32] <elukey>	 and disabled puppet of course
[11:44:55] <elukey>	 so the .service units, if any was running, would keep going, but they wouldn't be rescheduled
[11:45:17] <elukey>	 (it is an alternative to "Absent" all timers)
[11:46:30] <wikibugs>	 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8701662, @gmodena wrote: [...] >> @MatthewVernon brou...
[11:47:11] <elukey>	 updated the code review just to stop Yarn queues
[11:48:28] <elukey>	 lemme know when you are back so we can stop yarn
[11:49:04] <steve_munene>	 Were there multiple running jobs?
[11:49:09] <steve_munene>	 I am back
[11:49:35] <elukey>	 I didn't check yarn yet
[11:49:45] <elukey>	 or do you mean on an-launcher?
[11:49:45] <jynus>	 steve_munene: I contacted mili with all info, please coordinate to see who is the right person what could help us
[11:51:01] <steve_munene>	 Meant on an-launcher, that would require us to avoid stopping the timers
[11:51:11] <steve_munene>	 thanks jynus reaching out
[11:51:58] <elukey>	 steve_munene: so the .timer units (the config that systemd uses to schedule periodically .service units, like gobblin etc..) can be seen via `systemctl list-timers`
[11:51:59] <jynus>	 sorry for the urgency- it is not "wikis are down" levels of emergency, but I thought it was important asking for your help, thank tou
[11:52:35] <elukey>	 steve_munene: if you stop (manually via `systemctl stop blabla.timer`) only the timer unit, the .service one will keep going, but it will not be rescheduled
[11:52:51] <elukey>	 with the puppet "absent" way, we remove all configs for all the timers from the node
[11:52:56] <elukey>	 that is a bit more brutal
[11:53:22] <elukey>	 this is why I opted for simply stopping the relevant .timer jobs manually, less invasive (at least, I used to do it at the time)
[11:53:41] <elukey>	 for Yarn we'll need to deploy the patch and call the special refresh queue command
[11:54:05] <elukey>	 and I think we are about on time, the task suggests to do it half an hour before maintenance starts
[11:54:17] <steve_munene>	 thanks for the explanation.
[11:55:26] <steve_munene>	 Sure, good timing. Hop on a call?
[11:56:13] <elukey>	 if you don't mind let's sync in here, I am finishing one thing and I'd need to prep for the maintenance as well :)
[11:57:44] <elukey>	 so the maintenance is in ~ 1 hour
[11:57:58] <elukey>	 and I am reading the DE section of https://phabricator.wikimedia.org/T331882
[11:58:08] <steve_munene>	 That’s cool
[11:58:17] <elukey>	 stopping the Yarn queues + HDFS safe mode can be done in a bit (30 mins before)
[11:58:26] <elukey>	 so we can focus on the depool actions
[11:58:47] <elukey>	 do you know how to depool a node? Otherwise I'll give you some infos
[11:59:40] <steve_munene>	 Haven’t done one yet
[12:00:28] <elukey>	 ack so there are two ways
[12:00:43] <elukey>	 1) you ssh on the node, and execute `sudo -i depool`
[12:00:54] <elukey>	 2) you use conftool from puppetmaster1001 (see https://wikitech.wikimedia.org/wiki/Conftool)
[12:01:37] <elukey>	 the good thing about 2) is that you get a log entry in the SRE's irc SAL automatically (https://sal.toolforge.org/production)
[12:01:50] <elukey>	 but 1) is fine as well, especially if you haven't done 2) before
[12:02:25] <elukey>	 and then, after the depool, you can check the status of the backend pooled or not in https://config-master.wikimedia.org/pybal/eqiad
[12:02:34] <elukey>	 (there are dedicated pages for every service)
[12:04:07] <elukey>	 and in our case, we need to depool some nodes in "aqs" and some nodes for "datahub"
[12:04:23] <elukey>	 (and after the maintenance, we need to repool them)
[12:07:42] <elukey>	 choose the path that you prefer, I can give you any info :)
[12:07:57] <elukey>	 after that, we'll start the procedure for Yarn 
[12:08:29] <steve_munene>	 Thanks, checking on 2 to get the right syntax.
[12:09:57] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ssingh)
[12:12:13] <steve_munene>	 Going with 2 should I get started on datahubsearch?
[12:13:16] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:16:17] <elukey>	 steve_munene: let's sync on the command to execute first, can you paste it in here?
[12:17:00] <elukey>	 (conftool is very powerful and the first times it is best to double check to avoid depooling too many things by mistake etc..)
[12:18:21] <steve_munene>	 sure, from puppetmaster1001 confctl depool --hostname datahubsearch1003.eqiad.wmnet
[12:19:40] <elukey>	 ack, remember to use sudo -i in front
[12:21:37] <steve_munene>	 Ack, same for the aqs servers?
[12:21:43] <elukey>	 yep exactly
[12:23:58] <steve_munene>	 cool getting started
[12:24:25] <elukey>	 ack, we are about on time to stop yarn queues 
[12:29:16] <steve_munene>	 Confirmed depool
[12:30:04] <elukey>	 nice :)
[12:30:11] <elukey>	 next step is yarn
[12:30:22] <elukey>	 so IIRC the procedure is the following:
[12:30:25] <elukey>	 1) merge  the puppet change
[12:30:42] <elukey>	 2) run puppet on an-master100[12], so that the yarn config gets updated
[12:30:44] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene)
[12:30:51] <elukey>	 3) run the command to refresh the queues
[12:31:44] <steve_munene>	 not familiar with the refresh queues
[12:32:14] <elukey>	 I was looking into https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration but I didn't find it
[12:32:39] <elukey>	 so you can restart the yarn resource managers on an-master100[12], or just run the refreshQueue command on a single node (it will reload the queue config)
[12:32:42] <elukey>	 lemme find it
[12:33:09] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:33:27] <elukey>	 should be something like `sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues`
[12:34:43] <wikibugs>	 (03PS3) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012)
[12:34:55] <steve_munene>	 ack getting started
[12:34:58] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:37:41] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:39:08] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon)
[12:39:48] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10achou) > We could def put them in the same event stream, as long as they share the same...
[12:41:27] <wikibugs>	 (03PS4) 10Lgaulia: Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012)
[12:42:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add first input delay schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/902693 (https://phabricator.wikimedia.org/T332012) (owner: 10Lgaulia)
[12:44:14] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon)
[12:44:49] <steve_munene>	 Done with the yarn queues and refreshed, waiting to Put HDFS into safe mode in a few
[12:44:59] <elukey>	 ack, do you know how?
[12:45:13] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad...
[12:45:27] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:45:45] <elukey>	 (if you want to check how the queues are doing, you can inspect https://yarn.wikimedia.org/cluster/scheduler?openQueues=Queue:%20root#Queue:%20root#Queue:%20default)
[12:46:31] <elukey>	 steve_munene: I am not sure what the current jobs in running state are doing, those could probably be idle spark sessions
[12:46:54] <elukey>	 we can avoid to kill them, but when you'll enter safemode those jobs will start to fail (if they are writing to hdfs)
[12:47:03] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:47:55] <elukey>	 dcausse: o/ we are about to enter HDFS safemode, the Flink job may not like it - https://yarn.wikimedia.org/cluster/app/application_1678266962370_104769
[12:48:15] <dcausse>	 elukey: thanks, stopping it
[12:48:41] <elukey>	 thanks :)
[12:49:07] <elukey>	 steve_munene: and once you are done, you can write in #wikimedia-sre that the DE part is good (and update the task's description as well)
[12:51:53] <steve_munene>	 cool, thanks elukey
[12:52:29] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10aborrero)
[12:57:57] <steve_munene>	 !log putting hdfs into safe mode as part of T331882
[12:58:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:58:01] <stashbot>	 T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882
[12:59:15] <wikibugs>	 10Data-Engineering-Planning, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 11): Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF)
[12:59:34] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene)
[13:00:02] <wikibugs>	 10Data-Engineering-Planning, 10Data-Engineering-Wikistats, 10Data Pipelines (Sprint 11): Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) a:03Antoine_Quhen
[13:00:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF)
[13:01:19] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Wikistats: Monthly pageview stats for March 2023 missing - https://phabricator.wikimedia.org/T333923 (10JArguello-WMF) a:05Antoine_Quhen→03None
[13:02:11] <elukey>	 steve_munene: nice!
[13:02:20] <elukey>	 so to rollback when everything is done:
[13:02:42] <steve_munene>	 sure, I shall reach out
[13:02:57] <elukey>	 - ssh to an-launcer, re-enable puppet and run it (should be sufficient to restore the state). 
[13:03:13] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80a32cef-9700-4047-8185-415ffca1aaa2) set by ayounsi@cumin1001 for 2:0...
[13:03:16] <elukey>	 err before it, remove safe mode
[13:03:41] <elukey>	 then revert the yarn queue patch and refresh its queues
[13:03:51] <elukey>	 and finally, repool all nodes via conftool
[13:04:02] <elukey>	 I'll be available if needed!
[13:04:43] <steve_munene>	 ack thanks.
[13:06:00] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad...
[13:15:33] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10hnowlan)
[13:15:50] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1017 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:15:52] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1019 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:15:54] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1020 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:15:54] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:15:58] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1021 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:16:18] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:16:48] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1016 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:16:48] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:17:18] <icinga-wm_>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) is CRITICAL: Test Get top files by mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:19:35] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&var-gobblin_job_name=event_default - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[13:20:33] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata)
[13:20:49] <elukey>	 all the above alerts are related to the network maintenance
[13:23:45] <jinxer-wm>	 (SystemdUnitFailed) resolved: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:23:56] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:23:57] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:28] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:50] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:50] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:52] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:53] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:24:58] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:25:14] <icinga-wm_>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:25:29] <jinxer-wm>	 (SystemdUnitFailed) firing: jupyter-appledora-singleuser.service Failed on stat1008:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:25:45] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (2) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[13:27:00] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:29:51] <jinxer-wm>	 (HdfsMissingBlocks) firing: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks
[13:33:06] <icinga-wm_>	 ACKNOWLEDGEMENT - MegaRAID on an-worker1132 is CRITICAL: CRITICAL: 6 failed LD(s) (Offline, Offline, Offline, Offline, Offline, Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T333960 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:33:25] <steve_munene>	 elukey: getting started on the reverse
[13:34:51] <jinxer-wm>	 (HdfsMissingBlocks) resolved: HDFS missing blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_missing_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=40&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsMissingBlocks
[13:35:14] <wikibugs>	 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 11): Event Driven Data Pipelines should be generated from a template - https://phabricator.wikimedia.org/T324980 (10Ottomata)
[13:36:38] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi) 05Open→03Resolved a:03ayounsi Closing the task as the upgrade is done.  It went extremely smoothly, thank you everybody!...
[13:39:03] <steve_munene>	 !log leave hdfs safemode T331882
[13:39:06] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:39:06] <stashbot>	 T331882: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882
[13:48:58] <wikibugs>	 10Data-Engineering, 10Machine-Learning-Team, 10Research, 10Event-Platform Value Stream (Sprint 11): Design event schema for ML scores/recommendations on current page state - https://phabricator.wikimedia.org/T331401 (10Ottomata) Okay, so it sounds like we are back to our preferred choice: one prediction pe...
[13:49:50] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[14:02:15] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Peachey88)
[14:07:30] <aqu>	 Hello steve_munene , are those 2 type of alerts "Is Last successful gobblin run" "HDFS missing blocks" temporary problems generated by the switch upgrade, or should we investigate?
[14:17:18] <steve_munene>	 Hi aqu They seem to have recovered but yes they were due to the switch upgrade
[14:21:45] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Stevemunene)
[14:43:40] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C...
[14:44:50] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (4) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[14:57:42] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:58:16] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: apt-daily-upgrade.service,apt-daily.service,clean_puppet_client_bucket.service,confd_prometheus_metrics.service,export_smart_data_dump.service,hadoop-hdfs-datanode.service,hadoop-yarn-nodemanager.service,ipmiseld.service,lldpd.service,logrotate.service,man-db.service,prometheus-debian-version-textfile.service,prometheus-ipmi-export
[14:58:16] <icinga-wm_>	 ce,prometheus-nic-firmware-textfile.service,prometheus-node-exporter-apt.service,prometheus-node-exporter.service,prometheus_intel_microcode.service,prometheus_puppet_agent_stats.service,rsyslog.service,syslog.socket,systemd-journald-audit.socket,systemd-journald-dev-log.socket,systemd-journald.service,systemd-journald.socket,systemd-timedated.service,systemd-tmpfiles-clean.service,user-runtime-dir@116.service,wmf_auto_restart_cron.servic
[14:58:16] <icinga-wm_>	 to_restart_exim4.service,wmf_auto_restart_lldpd.service,wmf_auto_restart_nagios-nrpe-server.service,wmf_auto_restart_nic-saturation-exporter.service,wmf_auto_restart_prometheus-ipmi-e https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:26] <icinga-wm_>	 PROBLEM - puppet last run on an-worker1132 is CRITICAL: CRITICAL: Puppet last ran 9 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:59:10] <icinga-wm_>	 PROBLEM - Hadoop DataNode on an-worker1132 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[15:00:23] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row C...
[15:10:02] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[15:19:50] <jinxer-wm>	 (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin  - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo
[15:21:30] <elukey>	 steve_munene: one nit - do you mind tp downtime an-worker1132 for some days?
[15:24:54] <wikibugs>	 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Here are the [[ https://docs.google.com/document/d/1T9vcUvbyWSDOFlj...
[15:29:17] <wikibugs>	 10Data-Engineering, 10Event-Platform Value Stream, 10EventStreams, 10Patch-For-Review: Include image/file changes in page-links-change - https://phabricator.wikimedia.org/T333497 (10Ottomata) Cool, thanks for the patch.  Let's involve some other users of this stream in a discussion before we decided to do...
[15:42:50] <steve_munene>	 On it elukey I agree it is quite noisy.
[15:43:53] <steve_munene>	 elukey:  think we can safely say the services are up with no major issues after the maintenance?
[15:53:42] <steve_munene>	 Would you recommend we exclude it from Hdfs and yarn as had been done here https://phabricator.wikimedia.org/T330979
[16:01:48] <elukey>	 steve_munene: ah interesting! I thought that the node wasn't in service
[16:02:01] <elukey>	 re: services - yes all good I think! 
[16:03:52] <elukey>	 ah wow I see in the tty (from mgmt console):
[16:03:53] <elukey>	 print_req_error: I/O error, dev sda, sector 109836976
[16:04:03] <elukey>	 so something is broken on an-worker1132 
[16:04:45] <elukey>	 trying to powercycle it
[16:05:21] <steve_munene>	 It was brought back short, there’s also a ticket raised with Dell
[16:05:54] <elukey>	 ahhh okok https://phabricator.wikimedia.org/T333960
[16:06:37] <elukey>	 wow LD down
[16:07:04] <icinga-wm_>	 PROBLEM - Host an-worker1132 is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:16] <elukey>	 okok so the node is completely down, we can exlude it from hdfs probably
[16:08:20] <elukey>	 until it is up and running
[16:08:29] <elukey>	 we can sync tomorrow about it if you wnat
[16:08:46] <steve_munene>	 pasted the wrong link here is the ticked with the LD details https://phabricator.wikimedia.org/T333091
[16:09:59] <steve_munene>	 ack, sending a patch to exclude/put it on standby sometime today
[16:19:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:34:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:45:12] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook)
[16:46:00] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10fnegri) +1
[16:56:36] <wikibugs>	 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10nskaggs)
[16:56:40] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10nskaggs)
[16:57:20] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10nskaggs) +1
[17:20:22] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook) ` openstack project create --description 'superset' superset --domain default openstack role add --project superset --user rook member openstack role add --project superset --user rook reader `
[17:20:28] <wikibugs>	 10Quarry, 10cloud-services-team (FY2022/2023-Q3): Consider moving Quarry to be an installation of Redash - https://phabricator.wikimedia.org/T169452 (10rook)
[17:21:04] <wikibugs>	 10Quarry, 10Cloud-VPS (Project-requests): Superset project - https://phabricator.wikimedia.org/T333986 (10rook) 05Open→03Resolved a:03rook
[17:53:26] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto)
[18:37:18] <wikibugs>	 (03PS1) 10Aqu: Use a disallow list to filter top articles sent to Cassandra [analytics/refinery] - 10https://gerrit.wikimedia.org/r/905701 (https://phabricator.wikimedia.org/T333940)
[18:38:06] <wikibugs>	 10Data-Engineering-Planning, 10SRE-swift-storage, 10Event-Platform Value Stream (Sprint 11): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) Answering some specific questions from Eric:  > Will disparate WMF...
[19:50:36] <wikibugs>	 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 8 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10colewhite)
[20:34:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:00:36] <icinga-wm_>	 PROBLEM - Webrequests Varnishkafka log producer on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:00:38] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:01:43] <icinga-wm_>	 RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:08:24] <icinga-wm_>	 RECOVERY - Webrequests Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[22:45:38] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) Open ticket  with dell Confirmed: Service Request 165628610 was successfully submitted.
[22:47:20] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333960 (10Jclark-ctr) 05Open→03Resolved T333091 duplicate ticket
[22:48:10] <wikibugs>	 10Data-Engineering, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10Jclark-ctr) Submitted 2nd ticket Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted.   They have not responded to 1st ticket except for asking for address a...