[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143694 (owner: 10TrainBranchBot) [00:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143915 [00:09:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143915 (owner: 10TrainBranchBot) [00:10:09] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 602.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:15] (03PS1) 10DDesouza: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143916 (https://phabricator.wikimedia.org/T219903) [00:15:36] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:15:49] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:15:51] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:16:06] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:16:08] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:16:26] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:18:16] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143916 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [00:19:54] (03CR) 10DDesouza: [V:03+2 C:03+2] miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143916 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [00:20:01] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143916 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [00:22:30] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:22:42] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:22:43] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:22:57] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:22:58] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:23:17] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:27:33] RECOVERY - Disk space on arclamp2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [00:29:04] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1143915 (owner: 10TrainBranchBot) [00:29:43] RECOVERY - Disk space on arclamp1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [00:36:15] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [00:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:41:43] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [00:41:45] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [00:41:46] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [00:41:48] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [00:41:49] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [00:41:52] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [00:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [01:07:43] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 248434232 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:43] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:19:25] PROBLEM - Disk space on clouddumps1001 is CRITICAL: DISK CRITICAL - free space: / 16324 MB (3% inode=99%): /tmp 16324 MB (3% inode=99%): /var/tmp 16324 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=clouddumps1001&var-datasource=eqiad+prometheus/ops [01:59:25] RECOVERY - Disk space on clouddumps1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=clouddumps1001&var-datasource=eqiad+prometheus/ops [02:09:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7ff3dfe12c10: Failed to establish a new connection: [Errno 113 [02:09:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:10:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, discovered_master: True, active_primary_shards: 1708, active_shards: 5121, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: [02:10:11] er_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.9609603747804 https://wikitech.wikimedia.org/wiki/Search%23Administration [02:12:09] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.38 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:26:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:32:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7fc747e24f50: Failed to establish a new connection: [Errno 113 [02:32:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [02:33:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, active_primary_shards: 1708, active_shards: 5121, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pending_tas [02:33:11] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.9609603747804 https://wikitech.wikimedia.org/wiki/Search%23Administration [03:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10809029 (10phaultfinder) [04:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:20:23] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:20:35] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:13] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:25] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:26:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:58:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:03:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH events) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:30:11] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9443/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9443): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPSConnection object at 0x7f2690f0bdd0: Failed to establish a new connection: [Errno 113 [13:30:11] te to host)) https://wikitech.wikimedia.org/wiki/Search%23Administration [13:31:11] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-omega-eqiad: cluster_name: production-search-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 34, number_of_data_nodes: 34, active_primary_shards: 1708, active_shards: 5121, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 2, delayed_unassigned_shards: 0, number_of_pending_tas [13:31:11] umber_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 99.9609603747804 https://wikitech.wikimedia.org/wiki/Search%23Administration [14:26:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:01:55] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [16:17:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [16:37:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:44:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [16:45:07] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [17:14:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [17:15:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [17:47:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:52:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext releases routed via main at eqiad: 24.25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:14:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:16:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [18:17:07] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [18:20:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT configmaps) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:25:03] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PUT configmaps) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:26:55] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:29:13] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [18:30:07] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 20, number_of_data_nodes: 14, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 760, active_shards: 1783, relocating_shards: 1, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [18:30:07] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [18:32:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:11:25] PROBLEM - Hadoop DataNode on an-worker1098 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [19:21:25] RECOVERY - Hadoop DataNode on an-worker1098 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [19:28:19] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:32:55] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [19:47:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [19:52:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:01:56] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [20:16:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [20:19:03] FIRING: KubernetesAPILatency: High Kubernetes API latency (PUT leases) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:24:03] RESOLVED: [2x] KubernetesAPILatency: High Kubernetes API latency (PUT configmaps) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-aux&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on elastic1068:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [20:41:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:42:07] FIRING: [2x] SLOMetricAbsent: wdqs-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [21:39:05] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:43:13] PROBLEM - Hadoop NodeManager on an-worker1199 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:44:11] PROBLEM - Hadoop NodeManager on an-worker1162 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:46:13] RECOVERY - Hadoop NodeManager on an-worker1199 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:48:29] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:51:05] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:53:05] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:03:05] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:07:29] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:07:31] PROBLEM - Hadoop NodeManager on an-worker1201 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:08:11] RECOVERY - Hadoop NodeManager on an-worker1162 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:13:07] PROBLEM - Etcd cluster health on aux-k8s-etcd1005 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [22:14:05] RECOVERY - Etcd cluster health on aux-k8s-etcd1005 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [22:21:31] RECOVERY - Hadoop NodeManager on an-worker1201 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:26:56] FIRING: [3x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:32:55] FIRING: SystemdUnitFailed: mediawiki_job_wikidata-updateQueryServiceLag.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:35:05] PROBLEM - Hadoop NodeManager on an-worker1151 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143954 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143954 (owner: 10TrainBranchBot) [23:47:33] PROBLEM - Disk space on arclamp2001 is CRITICAL: DISK CRITICAL - free space: /srv 10722 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp2001&var-datasource=codfw+prometheus/ops [23:47:41] PROBLEM - Hadoop NodeManager on an-worker1197 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:48:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1143954 (owner: 10TrainBranchBot) [23:49:43] PROBLEM - Disk space on arclamp1001 is CRITICAL: DISK CRITICAL - free space: /srv 10740 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=arclamp1001&var-datasource=eqiad+prometheus/ops [23:51:05] RECOVERY - Hadoop NodeManager on an-worker1151 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process