[00:22:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:10] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:34:49] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:59] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.268 second response time https://wikitech.wikimedia.org/wiki/Swift [01:48:43] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:27] PROBLEM - Check systemd state on prometheus4001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:05] RECOVERY - Check systemd state on prometheus4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:29:18] here, looking [02:29:30] o/ [02:29:59] eqsin availability dropped, I'm going to depool first and ask questions later [02:31:15] (03PS1) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/896771 [02:31:48] urandom: can you poke around for any obvious cause while I get that done? [02:32:21] oh whoops eqsin is back, maybe just a blip [02:32:22] hrm [02:32:44] yeah, was going to say [02:33:23] holding off on the depool for now, keeping it handy though [02:33:28] I'd sure like to know what happened [02:33:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:34:35] oh, I think maybe I see [02:35:12] (switched channels) [02:48:37] (03Abandoned) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/896771 (owner: 10RLazarus) [03:07:26] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10RLazarus) [03:30:25] PROBLEM - Swift https frontend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.411 second response time https://wikitech.wikimedia.org/wiki/Swift [03:32:09] RECOVERY - Swift https frontend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [04:45:37] (03PS1) 10Legoktm: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 [04:46:26] (03PS2) 10Legoktm: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 [04:47:28] (03PS3) 10Legoktm: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 [05:22:45] (03CR) 10Krinkle: [C: 03+1] Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [06:07:10] (03PS1) 10TsepoThoabala: Deploy action blocks on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896900 (https://phabricator.wikimedia.org/T330533) [06:44:16] (03PS1) 10TsepoThoabala: Undeploy SimilarEditors from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896936 (https://phabricator.wikimedia.org/T331718) [07:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:48:13] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) On March 9th ~ 16 UTC there was a severe drop in data ingested by Benthos: https://grafana.wikimedia.org/d/V0TSK7O4z/benthos?orgId=1&from=16... [07:49:10] !log stop and mask benthos-webrequest-live on centrallog1001 - T331801 [07:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:16] T331801: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 [07:49:57] !log restart benthos-webrequest-live on centrallog2002 - T331801 [07:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:56] !log restart benthos-webrequest-live on centrallog1002 - T331801 [07:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:47] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: benthos@webrequest_live.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:53] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) I see some text data in https://w.wiki/6Rzi, I'll recheck in a bit to see if everything is stable. [07:54:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230312T0800) [08:04:57] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:45] (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:32:47] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: benthos@webrequest_live.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:42:16] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Something is still off, the traffic volume reported by turnilo for live vs batch webrequest data is still different (live a lot less). Someth... [08:52:14] (03PS1) 10Elukey: Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list" [puppet] - 10https://gerrit.wikimedia.org/r/896043 [08:54:28] (03CR) 10Elukey: [C: 03+2] Revert "centrallog: Remove centrallog1001 from the kafka-jumbo allow list" [puppet] - 10https://gerrit.wikimedia.org/r/896043 (owner: 10Elukey) [08:58:51] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:45] (JobUnavailable) resolved: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:25:24] 10SRE-swift-storage, 10Commons: Some or all of the undeletion failed: The file "mwstore://local-multiwrite/local-public/d/d7/Elizabeth_Sombart,_February,_2023.jpg" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T331800 (10Aklapper) [09:25:45] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Re-added 1001 back into Kafka Jumbo's firewall allowed host list, and restarted benthos on it. The traffic volume increased a lot, but then w... [09:29:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:30:58] 10SRE, 10SRE Observability: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) I would try with a consumer group offset reset: ` kafka consumer-groups --describe --group benthos-webrequest-sampled-live --reset-offsets -... [09:39:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:06:04] (03PS1) 10Elukey: profile::benthos: change kafka consumer group name for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/897063 (https://phabricator.wikimedia.org/T331801) [10:12:42] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) ` elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --group benthos-webrequest-sampled-live kafka-consumer-gro... [10:25:55] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) I tried again https://commons.wikimedia.org/wiki/File:Gandhi_-_Young_India,_v._4,_1922.pdf (1150 M... [10:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:47:16] !log reset offsets on kafka jumbo for benthos webrequest live (as indicated in https://phabricator.wikimedia.org/T331801#8685569) [10:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:51:00] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Seems better now, from the consumer group's consistency point of view: ` elukey@kafka-jumbo1001:~$ kafka consumer-grou... [10:52:06] (03PS8) 10Giuseppe Lavagetto: Add check_dns_state to service.Service [software/spicerack] - 10https://gerrit.wikimedia.org/r/894655 [10:54:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:43] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) The traffic handled by benthos is around 1/3 of the original one now (improved but not really ok). I don't see clear in... [12:32:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:35:43] PROBLEM - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [12:42:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [12:52:31] RECOVERY - Mediawiki CirrusSearch pool counter rejections rate on alert1001 is OK: OK: Less than 1.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Pool_Counter_rejections_%28search_is_currently_too_busy%29 https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?viewPanel=4&orgId=1 [12:56:47] 10SRE, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10MarcoAurelio) 05Resolved→03Open Sorry for reopening. Since we migrated to mailman3 the list reactivated itself. Could you please re-close it according to the [[ https://wikitech.wikimedia.org... [12:57:00] 10SRE, 10Wikimedia-Mailing-lists: Close wikimediameta-l mailing list - https://phabricator.wikimedia.org/T233666 (10MarcoAurelio) a:05colewhite→03None resetting assignee just in case [13:29:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:44:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:45:02] hello [13:45:15] o/ [13:49:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:57:15] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Swift [13:59:01] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Swift [14:38:39] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:19:05] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:20:45] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Swift [15:22:53] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:09] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.279 second response time https://wikitech.wikimedia.org/wiki/Swift [15:31:53] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Swift [15:44:59] PROBLEM - Disk space on ms-be2040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdn1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2040&var-datasource=codfw+prometheus/ops [16:44:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:49:05] (03PS1) 10Zabe: UserRenameHandler: Use core RenameUser classes [extensions/AbuseFilter] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897187 (https://phabricator.wikimedia.org/T27482) [16:49:21] (03PS1) 10Zabe: use core Renameuser classes [extensions/LiquidThreads] (wmf/1.40.0-wmf.26) - 10https://gerrit.wikimedia.org/r/897188 (https://phabricator.wikimedia.org/T27482) [16:49:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:58:17] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) Tried to stop both consumers (benthos systemd units) on centrallog 1002 and 2002, reset again the offsets, start the co... [16:59:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:00:12] 10SRE, 10SRE Observability, 10Patch-For-Review: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10elukey) The weird thing is that I keep seeing zero consumers: ` elukey@kafka-jumbo1001:~$ kafka consumer-groups --describe --g... [17:04:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:42:33] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:49:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:50:03] RECOVERY - Disk space on ms-be2040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2040&var-datasource=codfw+prometheus/ops [17:59:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:04:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:07:44] (03PS2) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) [18:17:21] (03CR) 10JMeybohm: rdf-streaming-updater: add a "wcqs" release (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/896362 (owner: 10DCausse) [18:32:17] PROBLEM - Swift https backend on ms-fe2012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.375 second response time https://wikitech.wikimedia.org/wiki/Swift [18:33:59] RECOVERY - Swift https backend on ms-fe2012 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Swift [18:54:56] (03PS3) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) [19:06:10] (03PS4) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) [19:51:29] (03CR) 10JMeybohm: [V: 03+1] Move default kubernetes version to 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896134 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:05:27] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Tgr) >>! In T328872#8684815, @Rschen7754 wrote: > It appears to be somewhat random, however it throws an... [20:16:19] (03PS1) 10JMeybohm: calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [20:18:20] (03CR) 10CI reject: [V: 04-1] calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:19:38] (03PS2) 10JMeybohm: calico/kubernetes: Replace calicoctl token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [20:21:32] (03CR) 10JMeybohm: "PCC auto is huge on this one, also there are a bunch of hosts with differences to core resources, no idea why that is:" [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:42] (03PS3) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [20:36:44] (03PS1) 10JMeybohm: cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) [20:36:46] (03PS1) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) [20:37:20] (03CR) 10CI reject: [V: 04-1] cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:39:20] (03CR) 10CI reject: [V: 04-1] calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:40:47] (03Abandoned) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/896385 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:41:14] (03PS2) 10JMeybohm: cfssl/cert: Allow to absent cert resources [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) [20:41:15] (03PS2) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) [20:41:17] (03PS4) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [20:41:58] (03PS3) 10JMeybohm: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) [20:42:00] (03PS5) 10JMeybohm: calico/kubernetes: Replace calico cni and ctl tokens with client certs [puppet] - 10https://gerrit.wikimedia.org/r/897361 (https://phabricator.wikimedia.org/T328291) [20:43:44] (03CR) 10jenkins-bot: calico/kubernetes: Replace istio_cni_token with client cert [puppet] - 10https://gerrit.wikimedia.org/r/897365 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:47:40] (03CR) 10JMeybohm: "PCC: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40079/console" [puppet] - 10https://gerrit.wikimedia.org/r/897364 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [20:49:45] (JobUnavailable) firing: Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:27:09] (03CR) 10Jforrester: "Let's add a comment that this is temporary with a reference to the task?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [21:28:23] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:33:57] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [21:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:59:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:04:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:41:40] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Don-vip) Don't know if it helps but I just got this error with CropTool for my first attempt at creating... [23:21:43] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:23:23] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:23:35] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:23:35] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:23:59] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:11] PROBLEM - Host ps1-f3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:15] PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:15] PROBLEM - Host ps1-f1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:43] PROBLEM - Host ps1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:53] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:53] PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:24:53] PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:25:01] PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [23:25:29] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:26:13] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:28:15] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:59] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:29:45] (JobUnavailable) firing: (2) Reduced availability for job benthos in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency