[00:01:04] <wikibugs>	 (03CR) 10Wziko: "Ok, I have change my patch to only include values.yaml." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko)
[00:22:46] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:24:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:32:28] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[00:33:22] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128
[00:38:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128 (owner: 10TrainBranchBot)
[00:59:00] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128 (owner: 10TrainBranchBot)
[01:08:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129
[01:08:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129 (owner: 10TrainBranchBot)
[01:12:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 38156320 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:13:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7265800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:27:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129 (owner: 10TrainBranchBot)
[02:40:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:24:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:44:04] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101139
[05:46:16] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:46:14] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:24:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:25:24] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.023e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[09:31:45] <wikibugs>	 (03CR) 10Elukey: [C:03+1] hadoop: sort local-dirs [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) (owner: 10JHathaway)
[10:45:24] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9012 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[11:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:17:45] <jinxer-wm>	 FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[11:54:42] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:24:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:37:45] <jinxer-wm>	 RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability
[13:37:08] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:42:08] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:52:31] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:52:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:13] * Emperor here
[13:54:09] * sobanski acked the alert
[13:54:53] <Emperor>	 we don't have a runbook for port utilisation do we? I'm looking at superset
[13:57:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:57:47] <slyngs>	 I don't think there's a runbook, but I'll look
[13:59:31] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:00:10] <slyngs>	 Seems to be Jio again?
[14:00:40] <Emperor>	 I think both inbound and outbound eqiad are paging?
[14:00:48] <Emperor>	 sorry, I'm not very good at this
[14:01:05] <taavi>	 the alert itself is firing for both sides of the cr1-eqiad<->asw2-b-eqiad link, one inbound one outbound
[14:02:08] <Emperor>	 I don't really know how to investigate further
[14:02:41] <slyngs>	 Not my strong side either, but I think it's Jio again, so if we can find the URL maybe we can use requestctl.w.o to block it by looking at an existing Jio rule
[14:04:02] <Emperor>	 where should I be looking to confirm or otherwise Jio?
[14:04:31] <slyngs>	 I went to superset, webrequest sampled live and then click "Geo"
[14:04:45] <taavi>	 on the access switch side there's been a large jump in traffic to a cache proxy (https://librenms.wikimedia.org/device/device=161/tab=port/port=31460/), which suggests the theory of it being caused by user traffic
[14:07:03] <Emperor>	 slyngs: OK, but if I zoom out to last 2 hours, it doesn't seem to have changed in that time?
[14:08:35] <slyngs>	 Your right, they are just constantly generating a lot of requests
[14:08:39] <slyngs>	 You're
[14:09:11] <Emperor>	 and if I restrict to just eqiad (is that sensible, given it's eqiad that's paging), they don't appear at all
[14:10:37] <slyngs>	 Ah, should have applied that filter
[14:12:15] <slyngs>	 Nothing really sticks out
[14:12:37] <Emperor>	 the graph taavi linked to shows (I think) a relatively rapid growth since about 13:30 but I can't see anything thus far in superset that corresponds
[14:14:34] <slyngs>	 https://librenms.wikimedia.org/eventlog at 15:05 
[14:15:13] <slyngs>	 I don't know what it is, but it's red
[14:17:30] <topranks>	 this link seen a rise in usage 
[14:17:31] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[14:17:31] <topranks>	 https://grafana.wikimedia.org/goto/Qez6o-4Ng?orgId=1
[14:17:57] <topranks>	 though it's running quite hot anyway, we seen a little more that tipped it over the edge 
[14:18:17] <topranks>	 level has subsided in the last few mins 
[14:18:59] <slyngs>	 Where I even a little close in that it should be red in librenms?
[14:19:04] <slyngs>	 shouldn't
[14:19:31] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[14:19:56] <Emperor>	 !incidents
[14:19:56] <sirenbot>	 5518 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet)
[14:19:57] <sirenbot>	 5517 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[14:21:18] <Emperor>	 so it seems to have resolved itself, but I'm none the wiser as to what the cause was or how I should have been resolving it :( And I found a dead link in the dos playbook gdoc for extra fun
[14:21:45] <topranks>	 what do you mean 'red in librenms' ? you mean the alert in librenms ?
[14:22:24] <topranks>	 it was likely this was a single heavy traffic flow - though I'm not sure what it may have been 
[14:23:17] <topranks>	 there are 4x10G links in that bundle to asw2-b-eqiad, but the spike was only on one 
[14:23:18] <Emperor>	 I think he's refering to the https://librenms.wikimedia.org/eventlog bit at 14:05 UTC
[14:23:31] <topranks>	 and looking deeper, it's all going to cp1107 - so I suspect this is Wikimedia enterprise again :( 
[14:23:33] <Emperor>	 	xe-3/0/5 lane 0 Rx Power etc
[14:23:54] <Emperor>	 the text is red for that and the adjacent 4 entries, though they aren't alerting AFAICT
[14:24:03] <topranks>	 https://grafana.wikimedia.org/goto/c8iKA-4HR 
[14:24:44] <slyngs>	 The say 15:05 here, but I think that some localization. It's 	xe-3/0/5 lane 0 Rx Power	cr1-eqiad	Dbm under threshold: -40 dBm (< -20 dBm)
[14:26:10] <Emperor>	 topranks: if I select cp1107 in superset most of the traffic is going to AWS?
[14:26:29] <topranks>	 yeah that's where WME are running from 
[14:27:28] <slyngs>	 And 8 million AI scrapers :-)
[14:28:05] <topranks>	 They are doing HEAD requests to us, so the cp node needs to pull in a lot of data to respond, but the actual responses are small as we don't send them the full page 
[14:28:23] <topranks>	 hence we won't see the same jump in outbound internet use 
[14:29:09] <slyngs>	 I found them: https://superset.wikimedia.org/superset/dashboard/webrequest-live/?native_filters_key=K5rFnKZoJ1V7xPyu0BN3Xg3J1Gauy1bXnqOe6-VD6SUXTPto6q8hwK6B3ZakKXq6 <- WME/2.0 (https://enterprise.wikimedia.com/; wme_mgmt@wikimedia.org)
[14:30:21] <Emperor>	 Yeah, I see them on that node, but not obviously generating lots of traffic (nor obviously doing anything different in the last bit where we got paged)
[14:30:24] <slyngs>	 11.2k requests in the past 3 hours, 789M in traffic... I think
[14:31:22] <Emperor>	 and they look to be doing more GET than HEAD from that graph, pace topranks
[14:31:54] <Emperor>	 contrary-wise-again, they do seem to have been ramping up over the last couple of hours, which is odd to happen on a Saturday
[14:31:57] <topranks>	 could be, I was basing that on the pattern from earlier in the week 
[14:37:45] <Emperor>	 if I read superset correctly MWE has had 429M from cp1107 in the last 2 hours; is that Too Much?
[14:38:10] <Emperor>	 if so should we be thinking about applying a rate-limit to their UA for a bit?
[14:38:54] <topranks>	 right now it seems ok 
[14:39:03] <topranks>	 I'll drop them a line on slack and ask them to keep a lid on it 
[14:39:36] <wikibugs>	 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381720 (10phaultfinder) 03NEW
[14:39:40] <Emperor>	 topranks: OK, cool, thanks. Anything else we should be doing now?
[14:39:52] <topranks>	 no I think we can stand down 
[14:40:16] <slyngs>	 topranks: Thanks :-)
[14:40:16] <topranks>	 I'll keep an eye on it this afternoon, if it heats up again we might need to call them to reduce the rate 
[14:40:20] <topranks>	 or limit them our side 
[14:40:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:48] <Emperor>	 yeah, presumably we could requestctl to rate limit their UA. I think if we get paged again this weekend I'd suggest doing so at that point
[14:41:04] <slyngs>	 That seems reasonable
[14:41:52] <Emperor>	 Hopefully not speak to any of you again until Monday :)
[14:42:02] <topranks>	 finger crossed :) 
[15:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:24:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:39:32] <wikibugs>	 (03PS1) 10AntiCompositeNumber: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594)
[16:47:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[17:09:50] <wikibugs>	 (03CR) 10AntiCompositeNumber: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[17:12:45] <wikibugs>	 (03PS2) 10AntiCompositeNumber: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594)
[18:16:31] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[18:17:13] <Emperor>	 Oh not again
[18:18:31] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[18:20:51] <Emperor>	 there is an uptick in views for a particular video, dunno if that's significant
[18:22:40] <Emperor>	 (I think not)
[18:26:31] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[18:26:55] <Emperor>	 OK, I'll leave it for now, then
[18:26:57] <Emperor>	 !incidents
[18:26:58] <sirenbot>	 5520 (ACKED)  Primary inbound port utilisation over 80%  (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet)
[18:26:58] <sirenbot>	 5519 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[18:26:58] <sirenbot>	 5518 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet)
[18:26:58] <sirenbot>	 5517 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr1-eqiad.wikimedia.org)
[18:28:05] <Emperor>	 (sorry, I have to be AFK for a bit now)
[18:28:31] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[19:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:18:47] <wikibugs>	 (03PS1) 10Pppery: Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157
[19:18:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157 (owner: 10Pppery)
[19:19:12] <wikibugs>	 (03Abandoned) 10Pppery: Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157 (owner: 10Pppery)
[19:22:38] <wikibugs>	 (03PS1) 10Pppery: Update VisualEditor config to drop exclusions based on Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851)
[19:35:04] <icinga-wm>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:35:58] <icinga-wm>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sun 29 Dec 2024 09:26:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:36:42] <icinga-wm>	 PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:37:32] <icinga-wm>	 RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[20:24:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:00:23] <wikibugs>	 (03PS1) 10Fabfur: haproxy:benthos: type must be string [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332)
[23:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable