[00:07:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T384592)', diff saved to https://phabricator.wikimedia.org/P73022 and previous config saved to /var/cache/conftool/dbconfig/20250202-000716-marostegui.json [00:07:19] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:22:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P73023 and previous config saved to /var/cache/conftool/dbconfig/20250202-002223-marostegui.json [00:37:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P73024 and previous config saved to /var/cache/conftool/dbconfig/20250202-003730-marostegui.json [00:52:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T384592)', diff saved to https://phabricator.wikimedia.org/P73025 and previous config saved to /var/cache/conftool/dbconfig/20250202-005236-marostegui.json [00:52:40] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:52:52] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2176.codfw.wmnet with reason: Maintenance [00:52:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T384592)', diff saved to https://phabricator.wikimedia.org/P73026 and previous config saved to /var/cache/conftool/dbconfig/20250202-005259-marostegui.json [01:04:32] FIRING: [2x] ProbeDown: Service restbase2034-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:07] RESOLVED: [2x] ProbeDown: Service restbase2034-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:46:29] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/79538ca0ecd2200ec8ea336678e0f9fbffc0e8c3cf23b85ec73aa7f9d3573aa0/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:04:39] PROBLEM - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [02:06:29] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:06:39] RECOVERY - Checks that the local airflow scheduler for airflow @analytics_product is working properly on an-airflow1006 is OK: OK: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/analytics_product AIRFLOW_HOME=/srv/airflow-analytics_product /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1006.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [02:06:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T384592)', diff saved to https://phabricator.wikimedia.org/P73027 and previous config saved to /var/cache/conftool/dbconfig/20250202-020654-marostegui.json [02:06:57] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:22:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P73028 and previous config saved to /var/cache/conftool/dbconfig/20250202-022201-marostegui.json [02:31:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P73029 and previous config saved to /var/cache/conftool/dbconfig/20250202-023708-marostegui.json [02:52:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T384592)', diff saved to https://phabricator.wikimedia.org/P73030 and previous config saved to /var/cache/conftool/dbconfig/20250202-025215-marostegui.json [02:52:18] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [02:52:30] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2188.codfw.wmnet with reason: Maintenance [02:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T384592)', diff saved to https://phabricator.wikimedia.org/P73031 and previous config saved to /var/cache/conftool/dbconfig/20250202-025237-marostegui.json [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:57] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:13:59] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 121.87 ms [03:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [03:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T384592)', diff saved to https://phabricator.wikimedia.org/P73032 and previous config saved to /var/cache/conftool/dbconfig/20250202-035125-marostegui.json [03:51:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:06:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P73033 and previous config saved to /var/cache/conftool/dbconfig/20250202-040632-marostegui.json [04:21:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P73034 and previous config saved to /var/cache/conftool/dbconfig/20250202-042139-marostegui.json [04:36:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T384592)', diff saved to https://phabricator.wikimedia.org/P73035 and previous config saved to /var/cache/conftool/dbconfig/20250202-043646-marostegui.json [04:36:49] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:37:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2202.codfw.wmnet with reason: Maintenance [05:27:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2203.codfw.wmnet with reason: Maintenance [05:27:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T384592)', diff saved to https://phabricator.wikimedia.org/P73036 and previous config saved to /var/cache/conftool/dbconfig/20250202-052741-marostegui.json [05:27:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:25:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T384592)', diff saved to https://phabricator.wikimedia.org/P73037 and previous config saved to /var/cache/conftool/dbconfig/20250202-062554-marostegui.json [06:25:57] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [06:39:39] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:41:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P73038 and previous config saved to /var/cache/conftool/dbconfig/20250202-064101-marostegui.json [06:47:49] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:56:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P73039 and previous config saved to /var/cache/conftool/dbconfig/20250202-065608-marostegui.json [07:11:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T384592)', diff saved to https://phabricator.wikimedia.org/P73040 and previous config saved to /var/cache/conftool/dbconfig/20250202-071115-marostegui.json [07:11:18] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:11:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2216.codfw.wmnet with reason: Maintenance [07:11:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T384592)', diff saved to https://phabricator.wikimedia.org/P73041 and previous config saved to /var/cache/conftool/dbconfig/20250202-071137-marostegui.json [07:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250202T0800) [08:00:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:35] RECOVERY - Host ripe-atlas-eqiad is UP: PING WARNING - Packet loss = 66%, RTA = 30.68 ms [08:10:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T384592)', diff saved to https://phabricator.wikimedia.org/P73042 and previous config saved to /var/cache/conftool/dbconfig/20250202-081030-marostegui.json [08:10:34] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:13:59] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [08:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P73043 and previous config saved to /var/cache/conftool/dbconfig/20250202-082537-marostegui.json [08:40:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P73044 and previous config saved to /var/cache/conftool/dbconfig/20250202-084044-marostegui.json [08:55:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T384592)', diff saved to https://phabricator.wikimedia.org/P73045 and previous config saved to /var/cache/conftool/dbconfig/20250202-085551-marostegui.json [08:55:54] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:11:37] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:12:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [11:45:45] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1184.eqiad.wmnet with reason: Maintenance [12:00:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:36] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2212.codfw.wmnet with reason: Maintenance [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [15:44:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:37] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:27] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53513 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:47] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:51:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [19:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [20:07:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1190.eqiad.wmnet with reason: Maintenance [20:07:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T384592)', diff saved to https://phabricator.wikimedia.org/P73046 and previous config saved to /var/cache/conftool/dbconfig/20250202-200724-marostegui.json [20:07:27] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:08:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady