[00:01:04] (03CR) 10Wziko: "Ok, I have change my patch to only include values.yaml." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837 (owner: 10Wziko) [00:22:46] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:24:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:32:28] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:33:22] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128 [00:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128 (owner: 10TrainBranchBot) [00:59:00] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1101128 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129 (owner: 10TrainBranchBot) [01:12:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 38156320 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7265800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:27:17] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1101129 (owner: 10TrainBranchBot) [02:40:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:04] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101139 [05:46:16] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:46:14] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:24] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.023e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [09:31:45] (03CR) 10Elukey: [C:03+1] hadoop: sort local-dirs [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) (owner: 10JHathaway) [10:45:24] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9012 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [11:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:17:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [11:54:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 208, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [13:37:08] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:42:08] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [13:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:13] * Emperor here [13:54:09] * sobanski acked the alert [13:54:53] we don't have a runbook for port utilisation do we? I'm looking at superset [13:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:57:47] I don't think there's a runbook, but I'll look [13:59:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:00:10] Seems to be Jio again? [14:00:40] I think both inbound and outbound eqiad are paging? [14:00:48] sorry, I'm not very good at this [14:01:05] the alert itself is firing for both sides of the cr1-eqiad<->asw2-b-eqiad link, one inbound one outbound [14:02:08] I don't really know how to investigate further [14:02:41] Not my strong side either, but I think it's Jio again, so if we can find the URL maybe we can use requestctl.w.o to block it by looking at an existing Jio rule [14:04:02] where should I be looking to confirm or otherwise Jio? [14:04:31] I went to superset, webrequest sampled live and then click "Geo" [14:04:45] on the access switch side there's been a large jump in traffic to a cache proxy (https://librenms.wikimedia.org/device/device=161/tab=port/port=31460/), which suggests the theory of it being caused by user traffic [14:07:03] slyngs: OK, but if I zoom out to last 2 hours, it doesn't seem to have changed in that time? [14:08:35] Your right, they are just constantly generating a lot of requests [14:08:39] You're [14:09:11] and if I restrict to just eqiad (is that sensible, given it's eqiad that's paging), they don't appear at all [14:10:37] Ah, should have applied that filter [14:12:15] Nothing really sticks out [14:12:37] the graph taavi linked to shows (I think) a relatively rapid growth since about 13:30 but I can't see anything thus far in superset that corresponds [14:14:34] https://librenms.wikimedia.org/eventlog at 15:05 [14:15:13] I don't know what it is, but it's red [14:17:30] this link seen a rise in usage [14:17:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [14:17:31] https://grafana.wikimedia.org/goto/Qez6o-4Ng?orgId=1 [14:17:57] though it's running quite hot anyway, we seen a little more that tipped it over the edge [14:18:17] level has subsided in the last few mins [14:18:59] Where I even a little close in that it should be red in librenms? [14:19:04] shouldn't [14:19:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [14:19:56] !incidents [14:19:56] 5518 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [14:19:57] 5517 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [14:21:18] so it seems to have resolved itself, but I'm none the wiser as to what the cause was or how I should have been resolving it :( And I found a dead link in the dos playbook gdoc for extra fun [14:21:45] what do you mean 'red in librenms' ? you mean the alert in librenms ? [14:22:24] it was likely this was a single heavy traffic flow - though I'm not sure what it may have been [14:23:17] there are 4x10G links in that bundle to asw2-b-eqiad, but the spike was only on one [14:23:18] I think he's refering to the https://librenms.wikimedia.org/eventlog bit at 14:05 UTC [14:23:31] and looking deeper, it's all going to cp1107 - so I suspect this is Wikimedia enterprise again :( [14:23:33] xe-3/0/5 lane 0 Rx Power etc [14:23:54] the text is red for that and the adjacent 4 entries, though they aren't alerting AFAICT [14:24:03] https://grafana.wikimedia.org/goto/c8iKA-4HR [14:24:44] The say 15:05 here, but I think that some localization. It's xe-3/0/5 lane 0 Rx Power cr1-eqiad Dbm under threshold: -40 dBm (< -20 dBm) [14:26:10] topranks: if I select cp1107 in superset most of the traffic is going to AWS? [14:26:29] yeah that's where WME are running from [14:27:28] And 8 million AI scrapers :-) [14:28:05] They are doing HEAD requests to us, so the cp node needs to pull in a lot of data to respond, but the actual responses are small as we don't send them the full page [14:28:23] hence we won't see the same jump in outbound internet use [14:29:09] I found them: https://superset.wikimedia.org/superset/dashboard/webrequest-live/?native_filters_key=K5rFnKZoJ1V7xPyu0BN3Xg3J1Gauy1bXnqOe6-VD6SUXTPto6q8hwK6B3ZakKXq6 <- WME/2.0 (https://enterprise.wikimedia.com/; wme_mgmt@wikimedia.org) [14:30:21] Yeah, I see them on that node, but not obviously generating lots of traffic (nor obviously doing anything different in the last bit where we got paged) [14:30:24] 11.2k requests in the past 3 hours, 789M in traffic... I think [14:31:22] and they look to be doing more GET than HEAD from that graph, pace topranks [14:31:54] contrary-wise-again, they do seem to have been ramping up over the last couple of hours, which is odd to happen on a Saturday [14:31:57] could be, I was basing that on the pattern from earlier in the week [14:37:45] if I read superset correctly MWE has had 429M from cp1107 in the last 2 hours; is that Too Much? [14:38:10] if so should we be thinking about applying a rate-limit to their UA for a bit? [14:38:54] right now it seems ok [14:39:03] I'll drop them a line on slack and ask them to keep a lid on it [14:39:36] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T381720 (10phaultfinder) 03NEW [14:39:40] topranks: OK, cool, thanks. Anything else we should be doing now? [14:39:52] no I think we can stand down [14:40:16] topranks: Thanks :-) [14:40:16] I'll keep an eye on it this afternoon, if it heats up again we might need to call them to reduce the rate [14:40:20] or limit them our side [14:40:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:48] yeah, presumably we could requestctl to rate limit their UA. I think if we get paged again this weekend I'd suggest doing so at that point [14:41:04] That seems reasonable [14:41:52] Hopefully not speak to any of you again until Monday :) [14:42:02] finger crossed :) [15:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:32] (03PS1) 10AntiCompositeNumber: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) [16:47:31] (03CR) 10CI reject: [V:04-1] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [17:09:50] (03CR) 10AntiCompositeNumber: "recheck" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [17:12:45] (03PS2) 10AntiCompositeNumber: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) [18:16:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:17:13] Oh not again [18:18:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [18:20:51] there is an uptick in views for a particular video, dunno if that's significant [18:22:40] (I think not) [18:26:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:26:55] OK, I'll leave it for now, then [18:26:57] !incidents [18:26:58] 5520 (ACKED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [18:26:58] 5519 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:26:58] 5518 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (asw2-b-eqiad.mgmt.eqiad.wmnet) [18:26:58] 5517 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [18:28:05] (sorry, I have to be AFK for a bit now) [18:28:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:18:47] (03PS1) 10Pppery: Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157 [19:18:56] (03CR) 10CI reject: [V:04-1] Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157 (owner: 10Pppery) [19:19:12] (03Abandoned) 10Pppery: Reinstate "Centralize enwiki's VisualEditor feedback page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101157 (owner: 10Pppery) [19:22:38] (03PS1) 10Pppery: Update VisualEditor config to drop exclusions based on Flow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101158 (https://phabricator.wikimedia.org/T224851) [19:35:04] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:35:58] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Sun 29 Dec 2024 09:26:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:36:42] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:37:32] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 536 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [20:24:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:00:23] (03PS1) 10Fabfur: haproxy:benthos: type must be string [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) [23:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable