[00:06:46] (03CR) 10BCornwall: [V:03+1 C:04-1] "Close! Just a test issue then we're out of the gate. The suggested change passes tests using a test CR at I5a3616171dd2696de1115de203b2665" [puppet] - 10https://gerrit.wikimedia.org/r/1154085 (owner: 10CDobbins) [00:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172727 [00:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172727 (owner: 10TrainBranchBot) [00:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:41] (03PS3) 10Anzx: throttle: add rules for Wikimania 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1172659 (https://phabricator.wikimedia.org/T400276) [01:16:16] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1172727 (owner: 10TrainBranchBot) [01:50:30] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 154841 MB (4% inode=99%): /var/lib/hadoop/data/d 156215 MB (4% inode=99%): /var/lib/hadoop/data/j 162614 MB (4% inode=99%): /var/lib/hadoop/data/f 157322 MB (4% inode=99%): /var/lib/hadoop/data/g 159617 MB (4% inode=99%): /var/lib/hadoop/data/i 159987 MB (4% inode=99%): /var/lib/hadoop/data/b 149852 MB (3% inode=99%): /var/lib/hadoop/data [01:50:30] 2 MB (4% inode=99%): /var/lib/hadoop/data/e 158270 MB (4% inode=99%): /var/lib/hadoop/data/h 156563 MB (4% inode=99%): /var/lib/hadoop/data/k 159195 MB (4% inode=99%): /var/lib/hadoop/data/m 157625 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [02:02:44] PROBLEM - Disk space on an-worker1133 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 159117 MB (4% inode=99%): /var/lib/hadoop/data/l 154759 MB (4% inode=99%): /var/lib/hadoop/data/c 154957 MB (4% inode=99%): /var/lib/hadoop/data/d 152915 MB (4% inode=99%): /var/lib/hadoop/data/j 155181 MB (4% inode=99%): /var/lib/hadoop/data/e 159445 MB (4% inode=99%): /var/lib/hadoop/data/h 154187 MB (4% inode=99%): /var/lib/hadoop/data [02:02:44] 7 MB (4% inode=99%): /var/lib/hadoop/data/m 154757 MB (4% inode=99%): /var/lib/hadoop/data/b 156658 MB (4% inode=99%): /var/lib/hadoop/data/g 148462 MB (3% inode=99%): /var/lib/hadoop/data/k 151928 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1133&var-datasource=eqiad+prometheus/ops [02:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:35:25] (03CR) 10Dzahn: [C:03+1] "let's ship it. comments at https://phabricator.wikimedia.org/T400367#11036587" [puppet] - 10https://gerrit.wikimedia.org/r/1172391 (https://phabricator.wikimedia.org/T400367) (owner: 10BCornwall) [04:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:11:48] FIRING: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:50:30] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 148396 MB (3% inode=99%): /var/lib/hadoop/data/d 158106 MB (4% inode=99%): /var/lib/hadoop/data/j 152323 MB (4% inode=99%): /var/lib/hadoop/data/f 154799 MB (4% inode=99%): /var/lib/hadoop/data/g 151678 MB (4% inode=99%): /var/lib/hadoop/data/i 148339 MB (3% inode=99%): /var/lib/hadoop/data/b 151221 MB (4% inode=99%): /var/lib/hadoop/data [05:50:30] 3 MB (4% inode=99%): /var/lib/hadoop/data/e 152864 MB (4% inode=99%): /var/lib/hadoop/data/h 158621 MB (4% inode=99%): /var/lib/hadoop/data/k 152336 MB (4% inode=99%): /var/lib/hadoop/data/m 155167 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [06:06:14] PROBLEM - Disk space on an-worker1129 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 164068 MB (4% inode=99%): /var/lib/hadoop/data/l 150151 MB (3% inode=99%): /var/lib/hadoop/data/k 162067 MB (4% inode=99%): /var/lib/hadoop/data/c 163676 MB (4% inode=99%): /var/lib/hadoop/data/d 164693 MB (4% inode=99%): /var/lib/hadoop/data/e 161868 MB (4% inode=99%): /var/lib/hadoop/data/g 159074 MB (4% inode=99%): /var/lib/hadoop/data [06:06:14] 9 MB (4% inode=99%): /var/lib/hadoop/data/i 156888 MB (4% inode=99%): /var/lib/hadoop/data/j 162028 MB (4% inode=99%): /var/lib/hadoop/data/m 164857 MB (4% inode=99%): /var/lib/hadoop/data/f 164363 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops [06:54:14] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:10:30] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 148933 MB (3% inode=99%): /var/lib/hadoop/data/d 157609 MB (4% inode=99%): /var/lib/hadoop/data/j 149350 MB (3% inode=99%): /var/lib/hadoop/data/f 156932 MB (4% inode=99%): /var/lib/hadoop/data/g 152992 MB (4% inode=99%): /var/lib/hadoop/data/i 150192 MB (4% inode=99%): /var/lib/hadoop/data/b 146057 MB (3% inode=99%): /var/lib/hadoop/data [08:10:30] 9 MB (4% inode=99%): /var/lib/hadoop/data/e 157748 MB (4% inode=99%): /var/lib/hadoop/data/h 155297 MB (4% inode=99%): /var/lib/hadoop/data/k 152799 MB (4% inode=99%): /var/lib/hadoop/data/m 154581 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [08:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:13:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 161575 MB (4% inode=99%): /var/lib/hadoop/data/e 160541 MB (4% inode=99%): /var/lib/hadoop/data/m 157824 MB (4% inode=99%): /var/lib/hadoop/data/k 157522 MB (4% inode=99%): /var/lib/hadoop/data/f 159147 MB (4% inode=99%): /var/lib/hadoop/data/g 158829 MB (4% inode=99%): /var/lib/hadoop/data/h 160889 MB (4% inode=99%): /var/lib/hadoop/data [08:13:20] 6 MB (4% inode=99%): /var/lib/hadoop/data/j 149671 MB (3% inode=99%): /var/lib/hadoop/data/c 159030 MB (4% inode=99%): /var/lib/hadoop/data/l 157829 MB (4% inode=99%): /var/lib/hadoop/data/b 161712 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [09:17:52] (03PS3) 10Federico Ceratto: zarcillo: Add egress to dyna.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) [09:24:26] (03CR) 10Federico Ceratto: "I updated the ipaddrs as discussed and replied to questions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172635 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [09:38:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79998 and previous config saved to /var/cache/conftool/dbconfig/20250726-093810-root.json [09:38:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P79999 and previous config saved to /var/cache/conftool/dbconfig/20250726-093815-root.json [09:53:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80000 and previous config saved to /var/cache/conftool/dbconfig/20250726-095315-root.json [09:53:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P80001 and previous config saved to /var/cache/conftool/dbconfig/20250726-095321-root.json [10:08:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80002 and previous config saved to /var/cache/conftool/dbconfig/20250726-100821-root.json [10:08:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P80003 and previous config saved to /var/cache/conftool/dbconfig/20250726-100827-root.json [10:10:04] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SD0001 - https://phabricator.wikimedia.org/T400405#11036758 (10SD0001) >>! In T400405#11035891, @jhathaway wrote: > @SD0001 would you kindly post a gerrit patch with your ssh public key, as a way to verify it, outside of this ti... [10:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80004 and previous config saved to /var/cache/conftool/dbconfig/20250726-102327-root.json [10:23:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P80005 and previous config saved to /var/cache/conftool/dbconfig/20250726-102333-root.json [10:38:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80006 and previous config saved to /var/cache/conftool/dbconfig/20250726-103833-root.json [10:38:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P80007 and previous config saved to /var/cache/conftool/dbconfig/20250726-103838-root.json [10:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:12:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:17:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:59] 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540 (10Novem_Linguae) 03NEW [11:30:30] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 158243 MB (4% inode=99%): /var/lib/hadoop/data/d 161406 MB (4% inode=99%): /var/lib/hadoop/data/j 153401 MB (4% inode=99%): /var/lib/hadoop/data/f 159911 MB (4% inode=99%): /var/lib/hadoop/data/g 161719 MB (4% inode=99%): /var/lib/hadoop/data/i 153171 MB (4% inode=99%): /var/lib/hadoop/data/b 150045 MB (3% inode=99%): /var/lib/hadoop/data [11:30:30] 7 MB (4% inode=99%): /var/lib/hadoop/data/e 158414 MB (4% inode=99%): /var/lib/hadoop/data/h 158678 MB (4% inode=99%): /var/lib/hadoop/data/k 156132 MB (4% inode=99%): /var/lib/hadoop/data/m 160143 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [11:32:28] 06SRE: traffic from Discord and Slack unfurler service is blocked by phabricator.wikimedia.org - https://phabricator.wikimedia.org/T400540#11036840 (10Novem_Linguae) [12:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:28:54] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:08:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:16:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:48:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 153141 MB (4% inode=99%): /var/lib/hadoop/data/e 158920 MB (4% inode=99%): /var/lib/hadoop/data/m 159203 MB (4% inode=99%): /var/lib/hadoop/data/k 158838 MB (4% inode=99%): /var/lib/hadoop/data/f 155865 MB (4% inode=99%): /var/lib/hadoop/data/g 158749 MB (4% inode=99%): /var/lib/hadoop/data/h 158906 MB (4% inode=99%): /var/lib/hadoop/data [13:53:20] 9 MB (4% inode=99%): /var/lib/hadoop/data/j 148619 MB (3% inode=99%): /var/lib/hadoop/data/c 158944 MB (4% inode=99%): /var/lib/hadoop/data/l 157490 MB (4% inode=99%): /var/lib/hadoop/data/b 155859 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [13:58:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:27] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:08:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:08:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:30] PROBLEM - Disk space on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 156769 MB (4% inode=99%): /var/lib/hadoop/data/d 157679 MB (4% inode=99%): /var/lib/hadoop/data/j 147827 MB (3% inode=99%): /var/lib/hadoop/data/f 158397 MB (4% inode=99%): /var/lib/hadoop/data/g 153999 MB (4% inode=99%): /var/lib/hadoop/data/i 156362 MB (4% inode=99%): /var/lib/hadoop/data/b 150740 MB (4% inode=99%): /var/lib/hadoop/data [14:10:30] 7 MB (4% inode=99%): /var/lib/hadoop/data/e 152900 MB (4% inode=99%): /var/lib/hadoop/data/h 157771 MB (4% inode=99%): /var/lib/hadoop/data/k 149106 MB (3% inode=99%): /var/lib/hadoop/data/m 157999 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1128&var-datasource=eqiad+prometheus/ops [14:11:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:13:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:21:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:33:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:53:20] PROBLEM - Disk space on an-worker1118 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 148520 MB (3% inode=99%): /var/lib/hadoop/data/e 154777 MB (4% inode=99%): /var/lib/hadoop/data/m 155393 MB (4% inode=99%): /var/lib/hadoop/data/k 157042 MB (4% inode=99%): /var/lib/hadoop/data/f 152807 MB (4% inode=99%): /var/lib/hadoop/data/g 158380 MB (4% inode=99%): /var/lib/hadoop/data/h 156748 MB (4% inode=99%): /var/lib/hadoop/data [14:53:20] 6 MB (4% inode=99%): /var/lib/hadoop/data/j 151243 MB (4% inode=99%): /var/lib/hadoop/data/c 158573 MB (4% inode=99%): /var/lib/hadoop/data/l 155960 MB (4% inode=99%): /var/lib/hadoop/data/b 153560 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1118&var-datasource=eqiad+prometheus/ops [14:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:06:14] PROBLEM - Disk space on an-worker1129 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 153611 MB (4% inode=99%): /var/lib/hadoop/data/l 148236 MB (3% inode=99%): /var/lib/hadoop/data/k 153036 MB (4% inode=99%): /var/lib/hadoop/data/c 158215 MB (4% inode=99%): /var/lib/hadoop/data/d 157181 MB (4% inode=99%): /var/lib/hadoop/data/e 152740 MB (4% inode=99%): /var/lib/hadoop/data/g 154377 MB (4% inode=99%): /var/lib/hadoop/data [15:06:14] 0 MB (4% inode=99%): /var/lib/hadoop/data/i 155393 MB (4% inode=99%): /var/lib/hadoop/data/j 151132 MB (4% inode=99%): /var/lib/hadoop/data/m 157413 MB (4% inode=99%): /var/lib/hadoop/data/f 154319 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops [15:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:42:44] PROBLEM - Disk space on an-worker1133 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 156963 MB (4% inode=99%): /var/lib/hadoop/data/l 154791 MB (4% inode=99%): /var/lib/hadoop/data/c 157137 MB (4% inode=99%): /var/lib/hadoop/data/d 152326 MB (4% inode=99%): /var/lib/hadoop/data/j 157582 MB (4% inode=99%): /var/lib/hadoop/data/e 157618 MB (4% inode=99%): /var/lib/hadoop/data/h 147639 MB (3% inode=99%): /var/lib/hadoop/data [15:42:44] 2 MB (3% inode=99%): /var/lib/hadoop/data/m 155431 MB (4% inode=99%): /var/lib/hadoop/data/b 151634 MB (4% inode=99%): /var/lib/hadoop/data/g 157092 MB (4% inode=99%): /var/lib/hadoop/data/k 157745 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1133&var-datasource=eqiad+prometheus/ops [15:46:14] PROBLEM - Disk space on an-worker1129 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/b 152619 MB (4% inode=99%): /var/lib/hadoop/data/l 148975 MB (3% inode=99%): /var/lib/hadoop/data/k 151544 MB (4% inode=99%): /var/lib/hadoop/data/c 157420 MB (4% inode=99%): /var/lib/hadoop/data/d 156187 MB (4% inode=99%): /var/lib/hadoop/data/e 151800 MB (4% inode=99%): /var/lib/hadoop/data/g 152401 MB (4% inode=99%): /var/lib/hadoop/data [15:46:14] 0 MB (4% inode=99%): /var/lib/hadoop/data/i 153033 MB (4% inode=99%): /var/lib/hadoop/data/j 151067 MB (4% inode=99%): /var/lib/hadoop/data/m 155917 MB (4% inode=99%): /var/lib/hadoop/data/f 156070 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1129&var-datasource=eqiad+prometheus/ops [18:11:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-wikikube_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:09:27] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:25:48] FIRING: ThumborHighHaproxyErrorRate: Thumbor haproxy error rate for pod thumbor-main-5575cbcfcf-jzxlh - eqiad - https://wikitech.wikimedia.org/wiki/Thumbor - https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor - https://alerts.wikimedia.org/?q=alertname%3DThumborHighHaproxyErrorRate