[00:03:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[00:38:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351
[00:38:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351 (owner: 10TrainBranchBot)
[00:48:19] <wikibugs>	 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10422363 (10Sreejithk2000) Awesome, thank you guys.
[00:55:35] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, 10Move-Files-To-Commons: Error using FileImporter and undelete file on Commons because of "local-multiwrite/local-public...is in an inconsistent state within the inte... - https://phabricator.wikimedia.org/T382715#10422369
[00:57:44] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351 (owner: 10TrainBranchBot)
[01:04:16] <wikibugs>	 06SRE: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730 (10Dylsss) 03NEW
[01:07:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 54282784 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:06] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353
[01:08:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353 (owner: 10TrainBranchBot)
[01:08:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:10:03] <wikibugs>	 06SRE, 10DNS: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10422389 (10Bugreporter)
[01:23:03] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[01:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422395 (10phaultfinder)
[01:26:42] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353 (owner: 10TrainBranchBot)
[01:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:33:11] <wikibugs>	 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422451 (10AntiCompositeNumber)
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0300)
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422513 (10phaultfinder)
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0400)
[04:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422556 (10phaultfinder)
[05:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0500)
[05:01:24] <logmsgbot>	 !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.5 (duration: 01m 21s)
[05:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:22] <wikibugs>	 (03CR) 10Pppery: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0700)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0800)
[08:14:14] <wikibugs>	 (03CR) 10Stang: "per T381197 I think they mean the replicas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[08:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422587 (10phaultfinder)
[09:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:39:15] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=wikikube-ctrl1004.eqiad.wmnet
[09:39:20] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/weight=10; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=wikikube-ctrl1004.eqiad.wmnet
[10:15:34] <icinga-wm>	 PROBLEM - MariaDB read only s5 #page on db2123 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.17-MariaDB-log, Uptime 105s, event_scheduler: True, 28.48 QPS, connection latency: 0.029987s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:16:03] <akosiaris>	 !incidents
[10:16:03] <sirenbot>	 5573 (UNACKED)  db2123 (paged)/MariaDB read only s5 (paged)
[10:16:03] <sirenbot>	 5572 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[10:16:03] <sirenbot>	 5566 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[10:16:04] <sirenbot>	 5568 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[10:16:04] <sirenbot>	 5567 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[10:16:04] <sirenbot>	 5571 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[10:16:04] <sirenbot>	 5570 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[10:16:05] <sirenbot>	 5569 (RESOLVED)  ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad)
[10:16:05] <sirenbot>	 5565 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[10:16:06] <sirenbot>	 5564 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams)
[10:16:06] <sirenbot>	 5563 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[10:16:07] <sirenbot>	 5562 (RESOLVED)  VarnishUnavailable global sre (varnish-upload thanos-rule)
[10:16:07] <sirenbot>	 5560 (RESOLVED)  [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad)
[10:16:08] <sirenbot>	 5561 (RESOLVED)  ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin)
[10:16:26] <Emperor>	 master comes back read only after crash is phone a DBA?
[10:16:33] <akosiaris>	 !ack 5573
[10:16:34] <sirenbot>	 5573 (ACKED)  db2123 (paged)/MariaDB read only s5 (paged)
[10:16:35] <Emperor>	 just checking orchestrator
[10:16:41] <akosiaris>	 me checking as well
[10:16:48] <akosiaris>	 going through the wikitech page
[10:17:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:17:15] <Amir1>	 this is new, I haven't seen it before
[10:17:30] <Amir1>	 let me check errors on the host
[10:17:59] <marostegui>	 I'm on a plane but if it's a master a suggest a switchover 
[10:18:26] <Emperor>	 mariadb has indeed been running for only 4 minutes on that host
[10:18:27] <Amir1>	 it's RO and that's causing a flood of error
[10:18:39] <Amir1>	 yeah, I will do a switchover
[10:18:50] <Amir1>	 let me set the section to read only in dbctl
[10:19:16] <Emperor>	 if you want someone else to do something, shout, I'm here
[10:19:39] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1106557 (https://phabricator.wikimedia.org/T382743)
[10:19:43] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1106558 (https://phabricator.wikimedia.org/T382743)
[10:19:54] <Amir1>	 T382743
[10:19:55] <stashbot>	 T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743
[10:20:29] <Amir1>	 marostegui: Shall I go forward with this ^?
[10:20:50] <marostegui>	 Yes go for the switch 
[10:21:02] <marostegui>	 You'll need: --master-read-only
[10:21:03] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s5 codfw as read-only for maintenance - T382743', diff saved to https://phabricator.wikimedia.org/P71743 and previous config saved to /var/cache/conftool/dbconfig/20241224-102102-ladsgroup.json
[10:21:07] <marostegui>	 On the switchover script 
[10:21:11] <Amir1>	 ah good point, thanks
[10:21:19] <marostegui>	 And remember to set read only off to the NEW master once it is switched 
[10:21:23] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T382743
[10:21:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T382743
[10:21:58] <marostegui>	 Amir1: ^ the new master will require read only off manually 
[10:22:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db2213 with weight 0 T382743', diff saved to https://phabricator.wikimedia.org/P71744 and previous config saved to /var/cache/conftool/dbconfig/20241224-102200-ladsgroup.json
[10:22:07] <marostegui>	 As if you use the option above it won't do it for you 
[10:22:14] <marostegui>	 (which makes sense)
[10:22:28] <Amir1>	 ah okay
[10:22:43] <Amir1>	 old master weight 100
[10:23:44] <Amir1>	 moving replicas
[10:23:54] <Amir1>	 I'm so happy now this takes 2 minues
[10:24:03] <Amir1>	 it used to take half an hour
[10:24:20] <Emperor>	 progress :)
[10:24:58] <Amir1>	 https://orchestrator.wikimedia.org/web/cluster/alias/s5
[10:25:05] <Amir1>	 you can check the progress in above
[10:26:00] <marostegui>	 Emperor: mind creating a task for the old master crash so we can track the HW issues etc?
[10:26:08] <Emperor>	 will do
[10:26:14] <marostegui>	 Thanks <3
[10:26:21] <marostegui>	 I need to keep flying 
[10:26:26] <marostegui>	 Ping me if needed 
[10:27:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:28:44] <Amir1>	 it really had to break on christmas day, you know
[10:30:09] <Amir1>	 okay that is done now
[10:30:28] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1106557 (https://phabricator.wikimedia.org/T382743) (owner: 10Gerrit maintenance bot)
[10:31:29] <Emperor>	 Made T382744 and included the mariadb barf-o-gram
[10:31:29] <stashbot>	 T382744: mysql crash on db2123 - https://phabricator.wikimedia.org/T382744
[10:32:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:33:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary and set section read-write T382743', diff saved to https://phabricator.wikimedia.org/P71745 and previous config saved to /var/cache/conftool/dbconfig/20241224-103304-ladsgroup.json
[10:33:09] <stashbot>	 T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743
[10:33:33] <Amir1>	 it should be back-ish now
[10:33:42] <Amir1>	 users should be able to edit
[10:33:57] <Amir1>	 can someone try it in dewiki? on a user subpage
[10:34:34] <icinga-wm>	 RECOVERY - MariaDB read only s5 #page on db2123 is OK: Version 10.6.17-MariaDB-log, Uptime 1245s, read_only: True, event_scheduler: True, 24.72 QPS, connection latency: 0.030986s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[10:35:57] <taavi>	 trying
[10:36:34] <taavi>	 Amir1: still getting a r/o error
[10:36:44] <Amir1>	 I think I need to manually drop the row in pt-heartbeat
[10:37:48] <taavi>	 The system administrator who locked it offered this explanation: The primary database server is running in read-only mode. 
[10:38:07] <Amir1>	 sigh, somehow when I ran it it didn't work
[10:38:41] <Amir1>	 ah I am the idiot. I set the read only to 1
[10:38:49] <Amir1>	 taavi: can you try now?
[10:39:10] <taavi>	 https://de.wikipedia.org/w/index.php?title=Benutzer:Taavi/sandbox&oldid=251541310
[10:39:11] <taavi>	 works
[10:39:20] <Amir1>	 awesome
[10:39:22] <akosiaris>	 cool
[10:39:26] <Amir1>	 now the clean up of the replicas
[10:39:29] <akosiaris>	 thanks Amir1! 
[10:41:33] <Amir1>	 >               Slave_IO_State: Waiting to reconnect after a failed master event read
[10:42:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:46:07] <Amir1>	 heartbeat is cleaned, so the lag should be okay now
[10:46:35] <Amir1>	 but replicas constantly decide to wait because they couldn't connect before. I need to flush some settings somewhere
[10:48:03] <Amir1>	 it's now fixed. Now only the old master is lagging because it's RO
[10:48:50] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1106558 (https://phabricator.wikimedia.org/T382743) (owner: 10Gerrit maintenance bot)
[10:52:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2123 T382743', diff saved to https://phabricator.wikimedia.org/P71746 and previous config saved to /var/cache/conftool/dbconfig/20241224-105203-ladsgroup.json
[10:52:08] <stashbot>	 T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743
[10:53:31] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on db2123.codfw.wmnet with reason: Broken T382743 T382743
[10:53:34] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on db2123.codfw.wmnet with reason: Broken T382743 T382743
[10:55:56] <Amir1>	 I restarted replication on the old master
[10:56:05] <Amir1>	 it's catching up it seems
[10:57:35] <Amir1>	 I have to do some family stuff. Things are fine, send me a sms or call if things are not okay
[10:58:06] <fabfur>	 don't know why I don't get pages on splunk...
[10:58:25] <Amir1>	 fabfur: for my case: I'm logged out
[10:58:38] * Emperor gets them by SMS
[11:06:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[12:57:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:59:17] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:01:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[13:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422749 (10phaultfinder)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:05] <Ammar>	 !log T382741 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bnwiki --logwiki=metawiki 'Esteban16' 'Renamed user f26394dcb19bd7bdad78f0d752896653'
[14:38:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:10] <stashbot>	 T382741: Unblock stuck global rename of Renamed_user_f26394dcb19bd7bdad78f0d752896653 - https://phabricator.wikimedia.org/T382741
[14:50:28] <wikibugs>	 10SRE-swift-storage: Cannot move File:فندق قصبة بوزنيقة.jpg to File:Kasbah Hotel in Bouznika.jpg on Commons - https://phabricator.wikimedia.org/T382750 (10mdaniels5757) 03NEW
[15:06:18] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751 (10ops-monitoring-bot) 03NEW
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:14:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422777 (10phaultfinder)
[15:33:22] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10422793 (10Ladsgroup) It's already depooled. It seems this exists too {T354593}
[15:59:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422798 (10phaultfinder)
[16:44:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422827 (10phaultfinder)
[17:14:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422854 (10phaultfinder)
[17:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422859 (10phaultfinder)
[18:34:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422861 (10phaultfinder)
[18:54:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422873 (10phaultfinder)
[19:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422875 (10phaultfinder)
[20:09:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422892 (10phaultfinder)
[21:28:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:24:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422931 (10phaultfinder)