[00:03:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [00:38:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351 [00:38:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351 (owner: 10TrainBranchBot) [00:48:19] 10SRE-swift-storage, 06Commons: Unable to restore File:Model 4000-First of Odakyu Electric Railway 2.JPG - https://phabricator.wikimedia.org/T382694#10422363 (10Sreejithk2000) Awesome, thank you guys. [00:55:35] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, 10Move-Files-To-Commons: Error using FileImporter and undelete file on Commons because of "local-multiwrite/local-public...is in an inconsistent state within the inte... - https://phabricator.wikimedia.org/T382715#10422369 [00:57:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1106351 (owner: 10TrainBranchBot) [01:04:16] 06SRE: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730 (10Dylsss) 03NEW [01:07:33] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 54282784 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353 [01:08:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353 (owner: 10TrainBranchBot) [01:08:33] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:10:03] 06SRE, 10DNS: Remove leftover DNS from declined chapter wikis causing language Wikipedia to resolve incorrectly on a *.wikimedia.org - https://phabricator.wikimedia.org/T382730#10422389 (10Bugreporter) [01:23:03] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [01:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422395 (10phaultfinder) [01:26:42] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1106353 (owner: 10TrainBranchBot) [01:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:11] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: High amount of 503/504 for swift uploads - https://phabricator.wikimedia.org/T382705#10422451 (10AntiCompositeNumber) [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0300) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422513 (10phaultfinder) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0400) [04:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422556 (10phaultfinder) [05:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0500) [05:01:24] !log mwpresync@deploy2002 Pruned MediaWiki: 1.44.0-wmf.5 (duration: 01m 21s) [05:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:22] (03CR) 10Pppery: zhwiki: Allow local securepoll setup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0700) [07:00:05] marostegui, Amir1, and arnaudb: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0700) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241224T0800) [08:14:14] (03CR) 10Stang: "per T381197 I think they mean the replicas." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [08:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422587 (10phaultfinder) [09:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:39:15] !log akosiaris@cumin1002 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=wikikube-ctrl1004.eqiad.wmnet [09:39:20] !log akosiaris@cumin1002 conftool action : set/weight=10; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=wikikube-ctrl1004.eqiad.wmnet [10:15:34] PROBLEM - MariaDB read only s5 #page on db2123 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.17-MariaDB-log, Uptime 105s, event_scheduler: True, 28.48 QPS, connection latency: 0.029987s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:16:03] !incidents [10:16:03] 5573 (UNACKED) db2123 (paged)/MariaDB read only s5 (paged) [10:16:03] 5572 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [10:16:03] 5566 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [10:16:04] 5568 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [10:16:04] 5567 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [10:16:04] 5571 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [10:16:04] 5570 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [10:16:05] 5569 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [10:16:05] 5565 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [10:16:06] 5564 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet esams) [10:16:06] 5563 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [10:16:07] 5562 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [10:16:07] 5560 (RESOLVED) [2x] ProbeDown sre (wikikube-ctrl1004:6443 probes/custom eqiad) [10:16:08] 5561 (RESOLVED) ATSBackendErrorsHigh cache_upload sre (swift.discovery.wmnet eqsin) [10:16:26] master comes back read only after crash is phone a DBA? [10:16:33] !ack 5573 [10:16:34] 5573 (ACKED) db2123 (paged)/MariaDB read only s5 (paged) [10:16:35] just checking orchestrator [10:16:41] me checking as well [10:16:48] going through the wikitech page [10:17:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:17:15] this is new, I haven't seen it before [10:17:30] let me check errors on the host [10:17:59] I'm on a plane but if it's a master a suggest a switchover [10:18:26] mariadb has indeed been running for only 4 minutes on that host [10:18:27] it's RO and that's causing a flood of error [10:18:39] yeah, I will do a switchover [10:18:50] let me set the section to read only in dbctl [10:19:16] if you want someone else to do something, shout, I'm here [10:19:39] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1106557 (https://phabricator.wikimedia.org/T382743) [10:19:43] (03PS1) 10Gerrit maintenance bot: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1106558 (https://phabricator.wikimedia.org/T382743) [10:19:54] T382743 [10:19:55] T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743 [10:20:29] marostegui: Shall I go forward with this ^? [10:20:50] Yes go for the switch [10:21:02] You'll need: --master-read-only [10:21:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s5 codfw as read-only for maintenance - T382743', diff saved to https://phabricator.wikimedia.org/P71743 and previous config saved to /var/cache/conftool/dbconfig/20241224-102102-ladsgroup.json [10:21:07] On the switchover script [10:21:11] ah good point, thanks [10:21:19] And remember to set read only off to the NEW master once it is switched [10:21:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T382743 [10:21:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T382743 [10:21:58] Amir1: ^ the new master will require read only off manually [10:22:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db2213 with weight 0 T382743', diff saved to https://phabricator.wikimedia.org/P71744 and previous config saved to /var/cache/conftool/dbconfig/20241224-102200-ladsgroup.json [10:22:07] As if you use the option above it won't do it for you [10:22:14] (which makes sense) [10:22:28] ah okay [10:22:43] old master weight 100 [10:23:44] moving replicas [10:23:54] I'm so happy now this takes 2 minues [10:24:03] it used to take half an hour [10:24:20] progress :) [10:24:58] https://orchestrator.wikimedia.org/web/cluster/alias/s5 [10:25:05] you can check the progress in above [10:26:00] Emperor: mind creating a task for the old master crash so we can track the HW issues etc? [10:26:08] will do [10:26:14] Thanks <3 [10:26:21] I need to keep flying [10:26:26] Ping me if needed [10:27:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:28:44] it really had to break on christmas day, you know [10:30:09] okay that is done now [10:30:28] (03CR) 10Ladsgroup: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1106557 (https://phabricator.wikimedia.org/T382743) (owner: 10Gerrit maintenance bot) [10:31:29] Made T382744 and included the mariadb barf-o-gram [10:31:29] T382744: mysql crash on db2123 - https://phabricator.wikimedia.org/T382744 [10:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:33:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary and set section read-write T382743', diff saved to https://phabricator.wikimedia.org/P71745 and previous config saved to /var/cache/conftool/dbconfig/20241224-103304-ladsgroup.json [10:33:09] T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743 [10:33:33] it should be back-ish now [10:33:42] users should be able to edit [10:33:57] can someone try it in dewiki? on a user subpage [10:34:34] RECOVERY - MariaDB read only s5 #page on db2123 is OK: Version 10.6.17-MariaDB-log, Uptime 1245s, read_only: True, event_scheduler: True, 24.72 QPS, connection latency: 0.030986s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:35:57] trying [10:36:34] Amir1: still getting a r/o error [10:36:44] I think I need to manually drop the row in pt-heartbeat [10:37:48] The system administrator who locked it offered this explanation: The primary database server is running in read-only mode. [10:38:07] sigh, somehow when I ran it it didn't work [10:38:41] ah I am the idiot. I set the read only to 1 [10:38:49] taavi: can you try now? [10:39:10] https://de.wikipedia.org/w/index.php?title=Benutzer:Taavi/sandbox&oldid=251541310 [10:39:11] works [10:39:20] awesome [10:39:22] cool [10:39:26] now the clean up of the replicas [10:39:29] thanks Amir1! [10:41:33] > Slave_IO_State: Waiting to reconnect after a failed master event read [10:42:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:46:07] heartbeat is cleaned, so the lag should be okay now [10:46:35] but replicas constantly decide to wait because they couldn't connect before. I need to flush some settings somewhere [10:48:03] it's now fixed. Now only the old master is lagging because it's RO [10:48:50] (03CR) 10Ladsgroup: [C:03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/1106558 (https://phabricator.wikimedia.org/T382743) (owner: 10Gerrit maintenance bot) [10:52:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db2123 T382743', diff saved to https://phabricator.wikimedia.org/P71746 and previous config saved to /var/cache/conftool/dbconfig/20241224-105203-ladsgroup.json [10:52:08] T382743: Switchover s5 master (db2123 -> db2213) - https://phabricator.wikimedia.org/T382743 [10:53:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on db2123.codfw.wmnet with reason: Broken T382743 T382743 [10:53:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on db2123.codfw.wmnet with reason: Broken T382743 T382743 [10:55:56] I restarted replication on the old master [10:56:05] it's catching up it seems [10:57:35] I have to do some family stuff. Things are fine, send me a sms or call if things are not okay [10:58:06] don't know why I don't get pages on splunk... [10:58:25] fabfur: for my case: I'm logged out [10:58:38] * Emperor gets them by SMS [11:06:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [12:57:12] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:17] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [13:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422749 (10phaultfinder) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:05] !log T382741 Ran mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=bnwiki --logwiki=metawiki 'Esteban16' 'Renamed user f26394dcb19bd7bdad78f0d752896653' [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] T382741: Unblock stuck global rename of Renamed_user_f26394dcb19bd7bdad78f0d752896653 - https://phabricator.wikimedia.org/T382741 [14:50:28] 10SRE-swift-storage: Cannot move File:فندق قصبة بوزنيقة.jpg to File:Kasbah Hotel in Bouznika.jpg on Commons - https://phabricator.wikimedia.org/T382750 (10mdaniels5757) 03NEW [15:06:18] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751 (10ops-monitoring-bot) 03NEW [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:14:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422777 (10phaultfinder) [15:33:22] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2143 - https://phabricator.wikimedia.org/T382751#10422793 (10Ladsgroup) It's already depooled. It seems this exists too {T354593} [15:59:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422798 (10phaultfinder) [16:44:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422827 (10phaultfinder) [17:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422854 (10phaultfinder) [17:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422859 (10phaultfinder) [18:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422861 (10phaultfinder) [18:54:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422873 (10phaultfinder) [19:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422875 (10phaultfinder) [20:09:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422892 (10phaultfinder) [21:28:25] FIRING: SystemdUnitFailed: netbox_ganeti_ulsfo_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10422931 (10phaultfinder)