[00:03:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:08:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:09:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:44] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [00:13:33] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:33:10] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:44:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:10] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:42] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.35 ms [00:53:58] (03PS1) 10Dzahn: admin: add Simon Kock to ldap_only admins (nda,wmde) [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) [00:57:16] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Dzahn) 05Open→03In progress [01:37:45] (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:40] (03PS4) 10Krinkle: multiversion: Fix reason for 'wikipedia' suffix not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816030 [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:16] (03PS5) 10Krinkle: multiversion: Fix reason for 'wikipedia' suffix not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816030 [02:03:10] (03CR) 10Krinkle: [C: 04-1] "Nevermind. Thanks to diffConfig job, the true (whether or not intentional) reason for this being the way it is today, is now that by mappi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816030 (owner: 10Krinkle) [02:05:34] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:01] (03PS1) 10Krinkle: tests: Fix broken testDatabaseSuffixMatchFamily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820831 [02:12:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:23:52] (03CR) 10Krinkle: [C: 03+2] tests: Fix broken testDatabaseSuffixMatchFamily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820831 (owner: 10Krinkle) [02:24:32] (03Merged) 10jenkins-bot: tests: Fix broken testDatabaseSuffixMatchFamily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820831 (owner: 10Krinkle) [02:26:06] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:27:03] (03PS1) 10Krinkle: wiki.php: Split 'wikitags' from 'dblists' viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820832 [02:27:44] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:28:21] (03CR) 10CI reject: [V: 04-1] wiki.php: Split 'wikitags' from 'dblists' viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820832 (owner: 10Krinkle) [02:30:11] (03PS2) 10Krinkle: wiki.php: Split 'wikitags' from 'dblists' viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820832 [02:31:26] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:31:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:32:37] (03CR) 10Krinkle: [C: 03+2] wiki.php: Split 'wikitags' from 'dblists' viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820832 (owner: 10Krinkle) [02:32:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:32:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:33:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:34:11] (03Merged) 10jenkins-bot: wiki.php: Split 'wikitags' from 'dblists' viewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820832 (owner: 10Krinkle) [02:37:16] RECOVERY - etcd request latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:37:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [02:38:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [02:38:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:39:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:39:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:40:22] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:41:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:42:46] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:44:40] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [02:49:14] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [02:51:53] (03CR) 10Krinkle: [C: 03+1] Drop unused wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820586 (owner: 10Gergő Tisza) [02:53:50] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:54:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:55:02] (03CR) 10Krinkle: [C: 04-1] "Other than updating from time to time to a current copy from core, I'd recommend we not fork. It's a very stable class with virtually no b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [02:55:25] (03PS2) 10Krinkle: Remove temporary benchmark script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819499 (owner: 10Daniel Kinzler) [02:55:27] (03CR) 10Krinkle: [C: 03+2] Remove temporary benchmark script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819499 (owner: 10Daniel Kinzler) [02:56:38] (03Merged) 10jenkins-bot: Remove temporary benchmark script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819499 (owner: 10Daniel Kinzler) [02:59:22] (03PS2) 10Krinkle: beta: Remove $wgMediaViewerNetworkPerformanceSamplingFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx) [02:59:25] (03CR) 10Krinkle: [C: 03+2] beta: Remove $wgMediaViewerNetworkPerformanceSamplingFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx) [03:00:14] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [03:01:01] (03Merged) 10jenkins-bot: beta: Remove $wgMediaViewerNetworkPerformanceSamplingFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx) [03:01:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:02:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:02:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:02:23] !log krinkle@deploy1002 Synchronized w/: I9067d47fab0324 (duration: 03m 25s) [03:03:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:08:18] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:08:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:09:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:09:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:10:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:15:06] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation={create,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:15:48] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:26:46] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:33:42] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:36:41] (03PS1) 10Andrew Bogott: profile::openstack::eqiad1::rabbitmq_nodes: switch to dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/820833 (https://phabricator.wikimedia.org/T314522) [03:49:50] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:06:34] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [04:14:12] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:06:08] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [05:15:30] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:24:38] 10Puppet, 10Infrastructure-Foundations: "operations/puppet" repo inaccessible to Windows developers - https://phabricator.wikimedia.org/T314698 (10Novem_Linguae) [05:29:56] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:05:40] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:24:42] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:38:52] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:56:24] (03CR) 10RhinosF1: [C: 04-1] "needs email adding" [puppet] - 10https://gerrit.wikimedia.org/r/820830 (https://phabricator.wikimedia.org/T314563) (owner: 10Dzahn) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220806T0700) [07:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:05:04] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [07:20:04] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 278 MB (2% inode=86%): /tmp 278 MB (2% inode=86%): /var/tmp 278 MB (2% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [08:01:29] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [08:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:28:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [08:48:15] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:06:19] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:10:17] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [09:10:47] (03CR) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [09:34:01] (03CR) 10Krinkle: [C: 04-1] Remove unused code from StaticSiteConfiguration class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [09:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:39:18] (ProbeDown) firing: (10) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:19] (ProbeDown) firing: (20) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:39:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:40:17] (PHPFPMTooBusy) firing: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:44:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [09:44:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:19] (ProbeDown) resolved: (23) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [09:45:17] (PHPFPMTooBusy) resolved: Not enough idle php7.2-fpm.service workers for Mediawiki appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [09:48:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:06:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [10:49:29] PROBLEM - SSH on mw1324.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:05:59] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [11:29:39] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:02:49] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [12:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:06:53] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [13:14:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10Aklapper) Note that the Phabricator account @VirginiaPoundstone is linked to [a self-created, non-WMF SUL wiki account](https://www.mediawiki.org/wiki/Special:Log... [13:17:44] (03PS4) 10Majavah: P:openstack::glance: remove primary_image_store concept [puppet] - 10https://gerrit.wikimedia.org/r/800949 [13:17:46] (03PS4) 10Majavah: openstack::cinder: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800950 [13:17:48] (03PS4) 10Majavah: openstack::nova: monitor the backend port [puppet] - 10https://gerrit.wikimedia.org/r/800951 [13:17:50] (03PS4) 10Majavah: P:openstack::haproxy: codfw1dev: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800952 (https://phabricator.wikimedia.org/T267194) [13:17:52] (03PS4) 10Majavah: P:openstack::haproxy: eqiad1: remove non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800953 (https://phabricator.wikimedia.org/T267194) [13:17:54] (03PS4) 10Majavah: P:openstack::designate::firewall: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/800954 (https://phabricator.wikimedia.org/T267194) [13:17:56] (03PS4) 10Majavah: P:openstack: misc cleanup for non-tls ports [puppet] - 10https://gerrit.wikimedia.org/r/800955 (https://phabricator.wikimedia.org/T267194) [13:20:26] (03PS1) 10Dreamy Jazz: Pin the reason migration stage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) [13:35:29] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:22] 10Puppet, 10Infrastructure-Foundations, 10Wikimedia-Mailing-lists: Unable to clone "operations/puppet" repo successfully on Windows - https://phabricator.wikimedia.org/T314698 (10Reedy) [13:50:21] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::eqiad1::rabbitmq_nodes: switch to dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/820833 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [13:53:29] RECOVERY - SSH on mw1324.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:56:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for virginiapoundstone - https://phabricator.wikimedia.org/T314676 (10Ottomata) Approved. [14:00:43] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01744 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:03:07] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.000969 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:36:01] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:40:13] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 297 MB (2% inode=86%): /tmp 297 MB (2% inode=86%): /var/tmp 297 MB (2% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [14:49:31] (03PS1) 10Andrew Bogott: nova.conf: amqp_durable_queues=true [puppet] - 10https://gerrit.wikimedia.org/r/820846 [14:54:52] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: amqp_durable_queues=true [puppet] - 10https://gerrit.wikimedia.org/r/820846 (owner: 10Andrew Bogott) [15:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:04:21] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.0155 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:25:37] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005329 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:26:12] (03CR) 10Ori: prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [15:27:21] (03PS16) 10Ori: Add puppet profile and role files for WikiFunctions. [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [15:32:15] (03PS2) 10Dreamy Jazz: Pin the reason migration stage to read and write old [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820838 (https://phabricator.wikimedia.org/T233004) [15:49:29] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdj1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [15:52:34] (03PS1) 10Andrew Bogott: nova.conf: use ipv4 address for rabbit hosts rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/820848 (https://phabricator.wikimedia.org/T314522) [15:55:44] (03CR) 10Andrew Bogott: [C: 03+2] nova.conf: use ipv4 address for rabbit hosts rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/820848 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [15:56:29] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:04:25] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:05:59] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:06:13] PROBLEM - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:03] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:10:53] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [16:11:45] (03PS1) 10Andrew Bogott: neutron.conf: use ipv4 address for rabbit hosts rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/820849 (https://phabricator.wikimedia.org/T314522) [16:13:26] (03CR) 10Andrew Bogott: [C: 03+2] neutron.conf: use ipv4 address for rabbit hosts rather than fqdn [puppet] - 10https://gerrit.wikimedia.org/r/820849 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [16:16:50] (03PS9) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [16:19:48] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:24:48] (03PS1) 10Andrew Bogott: neutron.conf: fix copy/paste error in port number [puppet] - 10https://gerrit.wikimedia.org/r/820850 (https://phabricator.wikimedia.org/T314522) [16:27:03] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [16:27:13] (03CR) 10Andrew Bogott: [C: 03+2] neutron.conf: fix copy/paste error in port number [puppet] - 10https://gerrit.wikimedia.org/r/820850 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott) [16:29:43] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:36:43] (03PS10) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [16:41:35] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [16:53:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:03:13] RECOVERY - Check systemd state on thanos-be2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:19] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:05:45] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:26:37] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:38:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [17:58:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:59:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:59:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T312863)', diff saved to https://phabricator.wikimedia.org/P32295 and previous config saved to /var/cache/conftool/dbconfig/20220806-175916-ladsgroup.json [17:59:20] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [18:06:55] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [18:23:36] (03PS11) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [18:25:55] https://commons.wikimedia.org/wiki/File:Keep_tidy_ask.svg is reporting "File not found: /v1/AUTH_mw/wikipedia-commons-local-public.15/1/15/Keep_tidy_ask.svg" [18:28:36] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:54:23] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [18:57:21] 10SRE-swift-storage, 10MediaWiki-File-management, 10MediaWiki-General, 10Thumbor: File:Keep_tidy_ask.svg 404 on Commons - https://phabricator.wikimedia.org/T314712 (10AntiCompositeNumber) [19:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:04:31] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:06:15] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:36:57] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:37:33] (03PS1) 10Andrew Bogott: Mark profile::openstack::base::rabbitmq_service_name invalid on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/820855 [19:43:21] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 4.34 ms [19:47:30] (03CR) 10Andrew Bogott: [C: 03+2] Mark profile::openstack::base::rabbitmq_service_name invalid on cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/820855 (owner: 10Andrew Bogott) [19:51:19] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [19:53:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:54:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:52] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-General, 10Thumbor: File:Keep_tidy_ask.svg 404 on Commons - https://phabricator.wikimedia.org/T314712 (10Legoktm) It had a thumbnail as recently as June 26, 2022: https://web.archive.org/web/20220626092601/https://commons.wikimedia.... [19:56:09] (03PS1) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) [19:56:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:57:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36634/console" [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [19:58:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:03] (03CR) 10CI reject: [V: 04-1] P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [20:02:16] (03PS2) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) [20:04:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36635/console" [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [20:05:33] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:05:49] (03CR) 10CI reject: [V: 04-1] P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [20:08:19] (03PS3) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) [20:11:12] (03CR) 10CI reject: [V: 04-1] P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [20:12:15] (03PS4) 10Majavah: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) [20:15:09] (03CR) 10CI reject: [V: 04-1] P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [20:15:27] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:43:37] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [20:50:45] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [21:41:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Here is a definite hint: ` 2022-08-06 21:35:40.231 2451963 ERROR oslo_messaging.rpc.server [req-c9f5ee93-a1c6-41a4-82f1-732852e8... [21:51:19] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:00:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Phantom config lurking in database! ` mysql:galera_backup@localhost [nova_api_eqiad1]> select * from cell_mappings; +-----------... [22:05:31] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:06:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:24:23] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:27] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:57:23] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [22:57:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) (above passwords have been replaced, but the spirit of the issue remains) [23:01:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:04:09] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:06:51] PROBLEM - etcd request latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [23:13:57] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:30:15] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.27 ms