[00:10:44] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:38:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123731 [00:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123731 (owner: 10TrainBranchBot) [00:39:38] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609 (10phaultfinder) 03NEW [00:49:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1123731 (owner: 10TrainBranchBot) [00:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:05:38] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5014/co" [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [01:09:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123733 [01:09:01] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123733 (owner: 10TrainBranchBot) [01:09:36] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10592981 (10phaultfinder) [01:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:29:20] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1123733 (owner: 10TrainBranchBot) [01:37:23] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:38:28] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387612 (10phaultfinder) 03NEW [01:46:28] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/2556af9d6d9eeb43634a2176e4b6ac2d14555933fd0412515fae1ae3682a8781/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:52:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [01:53:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:53:41] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:57:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [01:58:41] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:06:28] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:08:44] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:23:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:34:34] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:34] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:51:28] 06SRE, 10SRE-Access-Requests: Requesting access to wmf for bwang - https://phabricator.wikimedia.org/T387614 (10bwang) 03NEW [02:54:15] (03PS1) 10Bernard Wang: Use session storage for session tick events [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) [02:55:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1123734 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [02:56:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123449 (https://phabricator.wikimedia.org/T387400) (owner: 10Bernard Wang) [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:22:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [03:49:30] (03PS7) 10SD hehua: Set Transwiki namespace on zhwikivoyage and zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121622 (https://phabricator.wikimedia.org/T387055) [04:16:44] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:54:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.446s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:59:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.446s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:18:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387616 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [05:18:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387616 (10ops-monitoring-bot) 03NEW [05:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10593082 (10phaultfinder) [05:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:37:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:14:27] (03CR) 10Giuseppe Lavagetto: [C:03+1] shellbox-media: serve 1/8 of requests on 8.1 with more logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123690 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [06:36:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.208s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:37:02] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:41:15] RESOLVED: [3x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/canary (k8s) 1.074s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:25:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:32:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:35:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:38:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 2.073 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:46:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:49:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 1.644 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:52:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:54:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 7.246 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:58:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:01:42] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29290 bytes in 9.744 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:04:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:17:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29346 bytes in 9.486 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:20:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:32:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 3.662 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:39:54] PROBLEM - BFD status on cr1-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:40:54] RECOVERY - BFD status on cr1-magru is OK: UP: 6 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:46:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:51:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29343 bytes in 1.101 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:54:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:55:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29353 bytes in 0.721 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:00:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 1.252s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:01:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:03:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29292 bytes in 3.407 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:05:15] FIRING: [2x] MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 916.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:06:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:07:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29354 bytes in 2.389 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:10:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid/main (k8s) 966.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:22:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:32:38] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 7.543 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:36:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:39:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29350 bytes in 0.229 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:42:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:43:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 3.746 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:47:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:52:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.942 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3534 MB (3% inode=98%): /tmp 3534 MB (3% inode=98%): /var/tmp 3534 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [09:57:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:59:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29203 bytes in 0.533 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:02:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:03:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29203 bytes in 0.538 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:09:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:19:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29306 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:23:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:27:23] FIRING: [3x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:34] RESOLVED: [3x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:40] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 9.269 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:34:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:47:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29292 bytes in 1.216 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:53:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [10:56:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 1.215 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [10:59:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:01:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 0.241 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:05:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:10:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.303 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:13:39] (03PS1) 10Jelto: miscweb: add support for external-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1123738 (https://phabricator.wikimedia.org/T350794) [11:13:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:14:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3409 MB (3% inode=98%): /tmp 3409 MB (3% inode=98%): /var/tmp 3409 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:17:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:19:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 3.911 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:27:46] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387616#10593167 (10Peachey88) →14Duplicate dup:03T382984 [11:27:47] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10593169 (10Peachey88) [11:34:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:42:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29347 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:45:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [11:49:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3345 MB (3% inode=98%): /tmp 3345 MB (3% inode=98%): /var/tmp 3345 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [12:07:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:09:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29342 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:13:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:15:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29343 bytes in 5.609 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:20:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:22:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 28955 bytes in 0.835 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:25:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:33:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29354 bytes in 1.452 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:36:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:37:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29354 bytes in 5.447 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:40:12] (03PS1) 10Lucas Werkmeister: Remove $wgAllowAuthenticatedCrossOrigin again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) [12:41:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123741 (https://phabricator.wikimedia.org/T322944) (owner: 10Lucas Werkmeister) [12:41:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:49:34] FIRING: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 2.727 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:52:23] RESOLVED: ProbeDown: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:55:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:57:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29316 bytes in 2.365 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:01:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:04:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29346 bytes in 4.643 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:06:50] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:07:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:07:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:12:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:16:50] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:17:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 2.192 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:24:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:33:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29290 bytes in 2.361 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:36:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:36:50] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:41:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29346 bytes in 4.245 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:41:50] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:46:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:46:50] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:51:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29350 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [13:55:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [13:56:50] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:58:36] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 4.001 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:00:26] !log rebooting wikitech-static; the entire server was intermittently locking up [14:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:11:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29342 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:16:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2070-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:18:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387620 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [14:18:33] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387620 (10ops-monitoring-bot) 03NEW [14:20:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:21:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2064-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:22:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.290 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:25:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:26:39] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2064-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:27:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29345 bytes in 0.244 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:31:39] RESOLVED: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2064-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:35:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:34] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29346 bytes in 3.138 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [14:44:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [14:46:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:55:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29342 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:00:42] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:01:32] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29344 bytes in 1.646 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:01:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:03:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2068-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:04:15] (03PS1) 10Andrew Bogott: vendordata: specify certname for initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/1123744 [15:05:29] (03CR) 10Andrew Bogott: [C:03+2] vendordata: specify certname for initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/1123744 (owner: 10Andrew Bogott) [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:18:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387621 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [15:18:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387621 (10ops-monitoring-bot) 03NEW [15:21:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:23:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2068-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:28:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2068-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:33:39] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1079-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:36:50] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:38:39] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1071-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:43:39] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1071-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:46:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:48:39] FIRING: [10x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1071-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:51:49] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:53:39] FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1071-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:56:50] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:58:39] RESOLVED: [12x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1071-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:17:40] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:17:45] (03CR) 10Andrew Bogott: "This issue finally reappeared and I retested with altering work_mem; that seems to work! Amending patch..." [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:17:58] (03PS5) 10Andrew Bogott: cloudbackup: work around a postgresql bug by adjusting work_mem [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) [16:20:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [16:24:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10593261 (10phaultfinder) [16:26:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:27:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:31:39] RESOLVED: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2061-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:42:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:56:39] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2062-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:57:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:01:39] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2059-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:06:39] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2059-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:07:21] (03PS1) 10Andrew Bogott: wmcs-image-create: move the disable_puppet step into nova metadata [puppet] - 10https://gerrit.wikimedia.org/r/1123760 [17:07:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:07:46] (03CR) 10CI reject: [V:04-1] wmcs-image-create: move the disable_puppet step into nova metadata [puppet] - 10https://gerrit.wikimedia.org/r/1123760 (owner: 10Andrew Bogott) [17:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T387609#10593288 (10phaultfinder) [17:10:29] (03PS2) 10Andrew Bogott: wmcs-image-create: move the disable_puppet step into nova metadata [puppet] - 10https://gerrit.wikimedia.org/r/1123760 [17:11:19] (03PS3) 10Andrew Bogott: wmcs-image-create: move the disable_puppet step into nova metadata [puppet] - 10https://gerrit.wikimedia.org/r/1123760 [17:11:39] FIRING: [9x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:11:55] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1123760 (owner: 10Andrew Bogott) [17:12:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:16:39] FIRING: [10x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:21:39] FIRING: [11x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:22:52] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:22:57] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [17:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:26:39] FIRING: [7x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:27:40] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:27:48] (03CR) 10Andrew Bogott: [C:03+2] wmcs-image-create: move the disable_puppet step into nova metadata [puppet] - 10https://gerrit.wikimedia.org/r/1123760 (owner: 10Andrew Bogott) [17:36:39] FIRING: [8x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:37:40] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:37:44] (03PS1) 10Andrew Bogott: vendordata: on initial setup, don't remove /var/log subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1123762 [17:38:22] (03CR) 10Andrew Bogott: [C:03+2] vendordata: on initial setup, don't remove /var/log subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1123762 (owner: 10Andrew Bogott) [17:38:49] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:41:39] RESOLVED: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic2055-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:47:37] !log bounce mtail on centrallog2002 [17:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:49] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [17:57:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:01:54] FIRING: [2x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1083-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:05:40] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:05:52] ^^ not sure what's going on with those indexing alerts. they seem to be false positives [18:06:54] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1079-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:07:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1079-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:11:41] !bking@elastic1083 systemctl restart elasticsearch_7@production-search-eqiad.service "hopefully resolve indexing not increasing error" [18:11:54] FIRING: [5x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1076-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:15:41] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:16:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:18:51] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:19:18] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for chuckonwumelu - https://phabricator.wikimedia.org/T387627 (10Chuckonwumelu) 03NEW [18:21:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:23:51] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:26:54] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:27:39] FIRING: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:29:02] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [18:29:17] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:30:49] !log disabling the saneitizer on the cirrus streaming updater in codfw (hotfix for T387625) [18:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:51] T387625: SUP consumer stuck in codfw - https://phabricator.wikimedia.org/T387625 [18:35:52] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:36:44] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:36:54] RESOLVED: [6x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [18:37:04] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:37:13] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:37:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:37:52] !log disabling the saneitizer on the cirrus streaming updater for consumer-search@eqiad & consumer-cloudelastic (pre-emptive hotfix for T387625) [18:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] T387625: SUP consumer stuck in codfw - https://phabricator.wikimedia.org/T387625 [18:39:20] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:44:20] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:47:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [18:48:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387628 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [18:48:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387628 (10ops-monitoring-bot) 03NEW [19:28:50] PROBLEM - ganeti-noded running on ganeti2020 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [19:29:50] RECOVERY - ganeti-noded running on ganeti2020 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [19:40:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:40:26] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:42:27] (03PS4) 10Bartosz Dziewoński: Deduplicate JsonConfig config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 [19:42:27] (03PS1) 10Bartosz Dziewoński: Fix inconsistent definitions for $wmgLocalServices['chart-renderer'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 [19:44:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 16.07% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:44:46] (03CR) 10Bartosz Dziewoński: Deduplicate JsonConfig config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1122711 (owner: 10Bartosz Dziewoński) [19:45:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123766 (owner: 10Bartosz Dziewoński) [19:45:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:49:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:54:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:04:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 8.929% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:09:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:14:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:19:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 17.86% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:29:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 19.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:30:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=canary - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:34:30] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext/canary at eqiad: 21.43% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:40:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:40:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:40:43] RESOLVED: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@eqiad to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [20:57:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on aux-k8s-ctrl2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:05:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:23:48] FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate eventgate-analytics-external.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:43:50] PROBLEM - ganeti-noded running on ganeti2020 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [21:44:50] RECOVERY - ganeti-noded running on ganeti2020 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [21:48:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387632 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [21:48:30] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387632 (10ops-monitoring-bot) 03NEW [22:48:25] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T387646 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [22:48:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387646 (10ops-monitoring-bot) 03NEW [23:34:35] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:35:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10593699 (10Peachey88) [23:35:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387646#10593705 (10Peachey88) →14Duplicate dup:03T382984 [23:35:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387632#10593706 (10Peachey88) →14Duplicate dup:03T382984 [23:35:40] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387628#10593707 (10Peachey88) →14Duplicate dup:03T382984 [23:35:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387621#10593708 (10Peachey88) →14Duplicate dup:03T382984 [23:35:43] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T387620#10593709 (10Peachey88) →14Duplicate dup:03T382984 [23:38:11] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10593710 (10Peachey88) [23:52:23] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [23:59:10] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply [23:59:15] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply