[00:22:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:10] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:34:49] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:46:59] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.268 second response time https://wikitech.wikimedia.org/wiki/Swift
[01:48:43] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:27] <icinga-wm>	 PROBLEM - Check systemd state on prometheus4001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:24:05] <icinga-wm>	 RECOVERY - Check systemd state on prometheus4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:44] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:29:18] <rzl>	 here, looking
[02:29:30] <urandom>	 o/
[02:29:59] <rzl>	 eqsin availability dropped, I'm going to depool first and ask questions later
[02:31:15] <wikibugs>	 (03PS1) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/896771
[02:31:48] <rzl>	 urandom: can you poke around for any obvious cause while I get that done?
[02:32:21] <rzl>	 oh whoops eqsin is back, maybe just a blip
[02:32:22] <rzl>	 hrm
[02:32:44] <urandom>	 yeah, was going to say
[02:33:23] <rzl>	 holding off on the depool for now, keeping it handy though
[02:33:28] <rzl>	 I'd sure like to know what happened
[02:33:44] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[02:34:35] <rzl>	 oh, I think maybe I see
[02:35:12] <rzl>	 (switched channels)
[02:48:37] <wikibugs>	 (03Abandoned) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/896771 (owner: 10RLazarus)
[03:07:26] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10RLazarus)
[03:30:25] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.411 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:32:09] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift
[04:45:37] <wikibugs>	 (03PS1) 10Legoktm: Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837
[04:46:26] <wikibugs>	 (03PS2) 10Legoktm: Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837
[04:47:28] <wikibugs>	 (03PS3) 10Legoktm: Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837
[05:22:45] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Add <link rel="me"> to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm)
[06:07:10] <wikibugs>	 (03PS1) 10TsepoThoabala: Deploy action blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533)
[06:44:16] <wikibugs>	 (03PS1) 10TsepoThoabala: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718)
[07:04:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:09:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:48:13] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) On March 9th ~ 16 UTC there was a severe drop in data ingested by Benthos:  https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=16...
[07:49:10] <elukey>	 !log stop and mask benthos-webrequest-live on centrallog1001 - T331801
[07:49:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:16] <stashbot>	 T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801
[07:49:57] <elukey>	 !log restart benthos-webrequest-live on centrallog2002 - T331801
[07:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:56] <elukey>	 !log restart benthos-webrequest-live on centrallog1002 - T331801
[07:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:47] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: benthos@webrequest_live.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:53:53] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) I see some text data in https://w.wiki/6Rzi, I'll recheck in a bit to see if everything is stable.
[07:54:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230312T0800)
[08:04:57] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:32:47] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: benthos@webrequest_live.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:42:16] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Something is still off, the traffic volume reported by turnilo for live vs batch webrequest data is still different (live a lot less). Someth...
[08:52:14] <wikibugs>	 (03PS1) 10Elukey: Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list" [puppet] - 10https://gerrit.wikimedia.org/r/896043
[08:54:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list" [puppet] - 10https://gerrit.wikimedia.org/r/896043 (owner: 10Elukey)
[08:58:51] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:25:24] <wikibugs>	 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/d/d7/Elizabeth_Sombart,_February,_2023.jpg" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T331800 (10Aklapper)
[09:25:45] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Re-added 1001 back into Kafka Jumbo's firewall allowed host list, and restarted benthos on it. The traffic volume increased a lot, but then w...
[09:29:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:30:58] <wikibugs>	 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) I would try with a consumer group offset reset:  ` kafka consumer-groups --describe --group benthos-webrequest-sampled-live --reset-offsets -...
[09:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:06:04] <wikibugs>	 (03PS1) 10Elukey: profile::benthos: change kafka consumer group name for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/897063 (https://phabricator.wikimedia.org/T331801)
[10:12:42] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) ` elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live kafka-consumer-gro...
[10:25:55] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) I tried again https://commons.wikimedia.org/wiki/File:Gandhi_-_Young_India,_v._4,_1922.pdf (1150 M...
[10:27:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:32:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:47:16] <elukey>	 !log reset offsets on kafka jumbo for benthos webrequest live (as indicated in https://phabricator.wikimedia.org/T331801#8685569)
[10:47:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:51:00] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Seems better now, from the consumer group's consistency point of view:  ` elukey@kafka-jumbo1001:~$ kafka consumer-grou...
[10:52:06] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655
[10:54:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:14:43] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) The traffic handled by benthos is around 1/3 of the original one now (improved but not really ok). I don't see clear in...
[12:32:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:35:43] <icinga-wm>	 PROBLEM - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1
[12:42:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:52:31] <icinga-wm>	 RECOVERY - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is OK: OK: Less than 1.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1
[12:56:47] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10MarcoAurelio) 05Resolved→03Open Sorry for reopening. Since we migrated to mailman3 the list reactivated itself. Could you please re-close it according to the [[ https://wikitech.wikimedia.org...
[12:57:00] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10MarcoAurelio) a:05colewhite→03None resetting assignee just in case
[13:29:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:44:17] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:45:02] <Amir1>	 hello
[13:45:15] <taavi>	 o/
[13:49:17] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[13:57:15] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Swift
[13:59:01] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Swift
[14:38:39] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:19:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[15:20:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:22:53] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:09] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.279 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:31:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift
[15:44:59] <icinga-wm>	 PROBLEM - Disk space on ms-be2040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2040&var-datasource=codfw+prometheus/ops
[16:44:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:49:05] <wikibugs>	 (03PS1) 10Zabe: UserRenameHandler: Use core RenameUser classes [extensions/AbuseFilter] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897187 (https://phabricator.wikimedia.org/T27482)
[16:49:21] <wikibugs>	 (03PS1) 10Zabe: use core Renameuser classes [extensions/LiquidThreads] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897188 (https://phabricator.wikimedia.org/T27482)
[16:49:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:58:17] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Tried to stop both consumers (benthos systemd units) on centrallog 1002 and 2002, reset again the offsets, start the co...
[16:59:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:00:12] <wikibugs>	 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) The weird thing is that I keep seeing zero consumers:  ` elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --g...
[17:04:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:34:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:42:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:49:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[17:50:03] <icinga-wm>	 RECOVERY - Disk space on ms-be2040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2040&var-datasource=codfw+prometheus/ops
[17:59:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:04:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:07:44] <wikibugs>	 (03PS2) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291)
[18:17:21] <wikibugs>	 (03CR) 10JMeybohm: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse)
[18:32:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.375 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:33:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Swift
[18:54:56] <wikibugs>	 (03PS3) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291)
[19:06:10] <wikibugs>	 (03PS4) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291)
[19:51:29] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] Move default kubernetes version to 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:05:27] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Tgr) >>! In T328872#8684815, @Rschen7754 wrote: > It appears to be somewhat random, however it throws an...
[20:16:19] <wikibugs>	 (03PS1) 10JMeybohm: calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291)
[20:18:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:19:38] <wikibugs>	 (03PS2) 10JMeybohm: calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291)
[20:21:32] <wikibugs>	 (03CR) 10JMeybohm: "PCC auto is huge on this one, also there are a bunch of hosts with differences to core resources, no idea why that is:" [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:34:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:36:42] <wikibugs>	 (03PS3) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291)
[20:36:44] <wikibugs>	 (03PS1) 10JMeybohm: cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291)
[20:36:46] <wikibugs>	 (03PS1) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291)
[20:37:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:39:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:40:47] <wikibugs>	 (03Abandoned) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:41:14] <wikibugs>	 (03PS2) 10JMeybohm: cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291)
[20:41:15] <wikibugs>	 (03PS2) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291)
[20:41:17] <wikibugs>	 (03PS4) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291)
[20:41:58] <wikibugs>	 (03PS3) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291)
[20:42:00] <wikibugs>	 (03PS5) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291)
[20:43:44] <wikibugs>	 (03CR) 10jenkins-bot: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:47:40] <wikibugs>	 (03CR) 10JMeybohm: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40079/console" [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[20:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:27:09] <wikibugs>	 (03CR) 10Jforrester: "Let's add a comment that this is temporary with a reference to the task?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm)
[21:28:23] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:33:57] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[21:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:59:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:04:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[22:41:40] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Don-vip) Don't know if it helps but I just got this error with CropTool for my first attempt at creating...
[23:21:43] <icinga-wm>	 PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:23:23] <icinga-wm>	 PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:23:35] <icinga-wm>	 PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:23:35] <icinga-wm>	 PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:23:59] <icinga-wm>	 PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:11] <icinga-wm>	 PROBLEM - Host ps1-f3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:15] <icinga-wm>	 PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:15] <icinga-wm>	 PROBLEM - Host ps1-f1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:43] <icinga-wm>	 PROBLEM - Host ps1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:53] <icinga-wm>	 PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:53] <icinga-wm>	 PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:24:53] <icinga-wm>	 PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:25:01] <icinga-wm>	 PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[23:25:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:26:13] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:28:15] <icinga-wm>	 PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[23:28:59] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[23:29:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency