[00:08:31] (03PS1) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1115983 [00:08:35] (03PS1) 10Ncmonitor: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 [00:08:39] (03PS1) 10Ncmonitor: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115985 [00:10:25] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 613.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:31:31] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115988 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115988 (owner: 10TrainBranchBot) [00:50:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1115988 (owner: 10TrainBranchBot) [00:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T384592)', diff saved to https://phabricator.wikimedia.org/P72971 and previous config saved to /var/cache/conftool/dbconfig/20250201-005205-marostegui.json [00:52:09] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:53:46] (03CR) 10BCornwall: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115985 (owner: 10Ncmonitor) [00:54:42] (03CR) 10Ssingh: [C:03+1] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115985 (owner: 10Ncmonitor) [00:54:47] (03CR) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092931 (owner: 10Ncmonitor) [00:57:26] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1107566 (owner: 10Ncmonitor) [00:59:17] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1099314 (owner: 10Ncmonitor) [01:01:29] (03Abandoned) 10BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1107567 (owner: 10Ncmonitor) [01:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P72973 and previous config saved to /var/cache/conftool/dbconfig/20250201-010712-marostegui.json [01:08:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115995 [01:08:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115995 (owner: 10TrainBranchBot) [01:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10514346 (10phaultfinder) [01:10:41] (03PS1) 10BCornwall: ncmonitor: Ignore wikipediacreators.com [puppet] - 10https://gerrit.wikimedia.org/r/1115996 [01:12:26] (03CR) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1107566 (owner: 10Ncmonitor) [01:13:38] (03Abandoned) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1107566 (owner: 10Ncmonitor) [01:22:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P72974 and previous config saved to /var/cache/conftool/dbconfig/20250201-012219-marostegui.json [01:23:55] (03PS2) 10BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [01:25:39] !log import ncmonitor 1.3.1 into bookworm-wikimedia [01:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1115995 (owner: 10TrainBranchBot) [01:37:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T384592)', diff saved to https://phabricator.wikimedia.org/P72975 and previous config saved to /var/cache/conftool/dbconfig/20250201-013726-marostegui.json [01:37:29] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:37:37] RECOVERY - Host ripe-atlas-eqsin is UP: PING WARNING - Packet loss = 71%, RTA = 0.34 ms [01:37:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1219.eqiad.wmnet with reason: Maintenance [01:37:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T384592)', diff saved to https://phabricator.wikimedia.org/P72976 and previous config saved to /var/cache/conftool/dbconfig/20250201-013748-marostegui.json [01:40:57] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9266 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [01:44:01] PROBLEM - Host ripe-atlas-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [02:12:25] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:51] (03CR) 10Pppery: NCRedirRedirects: Automated MarkMonitor domain sync (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1115984 (owner: 10Ncmonitor) [02:48:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T384592)', diff saved to https://phabricator.wikimedia.org/P72977 and previous config saved to /var/cache/conftool/dbconfig/20250201-024829-marostegui.json [02:48:33] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P72978 and previous config saved to /var/cache/conftool/dbconfig/20250201-030337-marostegui.json [03:16:28] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [03:18:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P72979 and previous config saved to /var/cache/conftool/dbconfig/20250201-031843-marostegui.json [03:33:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T384592)', diff saved to https://phabricator.wikimedia.org/P72980 and previous config saved to /var/cache/conftool/dbconfig/20250201-033350-marostegui.json [03:33:53] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [03:34:06] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1232.eqiad.wmnet with reason: Maintenance [03:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T384592)', diff saved to https://phabricator.wikimedia.org/P72981 and previous config saved to /var/cache/conftool/dbconfig/20250201-033412-marostegui.json [04:44:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T384592)', diff saved to https://phabricator.wikimedia.org/P72982 and previous config saved to /var/cache/conftool/dbconfig/20250201-044444-marostegui.json [04:44:47] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [04:59:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P72983 and previous config saved to /var/cache/conftool/dbconfig/20250201-045951-marostegui.json [05:14:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P72984 and previous config saved to /var/cache/conftool/dbconfig/20250201-051458-marostegui.json [05:21:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:30:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T384592)', diff saved to https://phabricator.wikimedia.org/P72985 and previous config saved to /var/cache/conftool/dbconfig/20250201-053005-marostegui.json [05:30:08] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [05:30:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1234.eqiad.wmnet with reason: Maintenance [05:30:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T384592)', diff saved to https://phabricator.wikimedia.org/P72986 and previous config saved to /var/cache/conftool/dbconfig/20250201-053027-marostegui.json [05:34:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:21:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:45:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T384592)', diff saved to https://phabricator.wikimedia.org/P72987 and previous config saved to /var/cache/conftool/dbconfig/20250201-064555-marostegui.json [06:45:59] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:01:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P72988 and previous config saved to /var/cache/conftool/dbconfig/20250201-070103-marostegui.json [07:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P72989 and previous config saved to /var/cache/conftool/dbconfig/20250201-071609-marostegui.json [07:16:28] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [07:31:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T384592)', diff saved to https://phabricator.wikimedia.org/P72990 and previous config saved to /var/cache/conftool/dbconfig/20250201-073116-marostegui.json [07:31:20] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [07:31:32] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1235.eqiad.wmnet with reason: Maintenance [07:31:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T384592)', diff saved to https://phabricator.wikimedia.org/P72991 and previous config saved to /var/cache/conftool/dbconfig/20250201-073139-marostegui.json [08:38:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T384592)', diff saved to https://phabricator.wikimedia.org/P72992 and previous config saved to /var/cache/conftool/dbconfig/20250201-083827-marostegui.json [08:38:33] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [08:53:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P72993 and previous config saved to /var/cache/conftool/dbconfig/20250201-085335-marostegui.json [09:08:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P72994 and previous config saved to /var/cache/conftool/dbconfig/20250201-090842-marostegui.json [09:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T384592)', diff saved to https://phabricator.wikimedia.org/P72995 and previous config saved to /var/cache/conftool/dbconfig/20250201-092349-marostegui.json [09:23:52] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [09:24:05] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1239.eqiad.wmnet with reason: Maintenance [10:12:01] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T385357 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:12:08] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T385357 (10ops-monitoring-bot) 03NEW [10:22:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db1240.eqiad.wmnet with reason: Maintenance [10:31:58] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T385361 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [10:32:04] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T385361 (10ops-monitoring-bot) 03NEW [10:42:58] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T385357#10514544 (10Peachey88) →14Duplicate dup:03T382984 [10:42:59] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10514546 (10Peachey88) [10:43:15] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T385361#10514548 (10Peachey88) →14Duplicate dup:03T382984 [10:43:17] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T382984#10514550 (10Peachey88) [11:16:28] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [11:18:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [12:15:22] (03CR) 10Phuedx: [C:04-1] tests: Assert event stream configs have valid samples (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1115839 (owner: 10Phuedx) [12:22:20] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2141.codfw.wmnet with reason: Maintenance [12:44:32] FIRING: [2x] ProbeDown: Service restbase2034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:58] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T385363 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [12:52:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T385363 (10ops-monitoring-bot) 03NEW [12:57:07] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:32] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:07] RESOLVED: [4x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:19:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T384592)', diff saved to https://phabricator.wikimedia.org/P72996 and previous config saved to /var/cache/conftool/dbconfig/20250201-131925-marostegui.json [13:19:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [13:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10514615 (10phaultfinder) [13:49:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T384592)', diff saved to https://phabricator.wikimedia.org/P72997 and previous config saved to /var/cache/conftool/dbconfig/20250201-143125-marostegui.json [14:31:29] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [14:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P72998 and previous config saved to /var/cache/conftool/dbconfig/20250201-144632-marostegui.json [15:01:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P72999 and previous config saved to /var/cache/conftool/dbconfig/20250201-150139-marostegui.json [15:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:28] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [15:16:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T384592)', diff saved to https://phabricator.wikimedia.org/P73000 and previous config saved to /var/cache/conftool/dbconfig/20250201-151646-marostegui.json [15:16:50] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [15:17:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2146.codfw.wmnet with reason: Maintenance [15:17:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T384592)', diff saved to https://phabricator.wikimedia.org/P73001 and previous config saved to /var/cache/conftool/dbconfig/20250201-151709-marostegui.json [16:02:06] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T385368 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [16:20:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T384592)', diff saved to https://phabricator.wikimedia.org/P73002 and previous config saved to /var/cache/conftool/dbconfig/20250201-162041-marostegui.json [16:20:44] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [16:33:48] PROBLEM - Host cr2-magru is DOWN: PING CRITICAL - Packet loss = 100% [16:34:40] RECOVERY - Host cr2-magru is UP: PING OK - Packet loss = 0%, RTA = 115.34 ms [16:35:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P73003 and previous config saved to /var/cache/conftool/dbconfig/20250201-163548-marostegui.json [16:38:10] A repeat of https://phabricator.wikimedia.org/T384774? [16:41:02] <_joe_> sobanski: I'd guess so [16:41:06] sobanski: yes. I've updated the task. [16:41:16] I don't think it's worth depooling over tbh [16:41:22] Thanks, I was just logging in to reopen :) [16:41:45] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [16:42:46] I've downtimed the host and will pick it up Monday [16:43:22] 👍 [16:50:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P73004 and previous config saved to /var/cache/conftool/dbconfig/20250201-165055-marostegui.json [17:06:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T384592)', diff saved to https://phabricator.wikimedia.org/P73005 and previous config saved to /var/cache/conftool/dbconfig/20250201-170602-marostegui.json [17:06:05] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [17:06:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2153.codfw.wmnet with reason: Maintenance [17:06:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T384592)', diff saved to https://phabricator.wikimedia.org/P73006 and previous config saved to /var/cache/conftool/dbconfig/20250201-170624-marostegui.json [17:49:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T384592)', diff saved to https://phabricator.wikimedia.org/P73007 and previous config saved to /var/cache/conftool/dbconfig/20250201-181943-marostegui.json [18:19:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [18:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P73008 and previous config saved to /var/cache/conftool/dbconfig/20250201-183450-marostegui.json [18:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P73009 and previous config saved to /var/cache/conftool/dbconfig/20250201-184957-marostegui.json [19:05:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T384592)', diff saved to https://phabricator.wikimedia.org/P73010 and previous config saved to /var/cache/conftool/dbconfig/20250201-190504-marostegui.json [19:05:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [19:05:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:05:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T384592)', diff saved to https://phabricator.wikimedia.org/P73011 and previous config saved to /var/cache/conftool/dbconfig/20250201-190526-marostegui.json [19:16:28] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady [19:49:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T384592)', diff saved to https://phabricator.wikimedia.org/P73012 and previous config saved to /var/cache/conftool/dbconfig/20250201-201004-marostegui.json [20:10:07] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P73013 and previous config saved to /var/cache/conftool/dbconfig/20250201-202511-marostegui.json [20:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P73014 and previous config saved to /var/cache/conftool/dbconfig/20250201-204018-marostegui.json [20:55:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T384592)', diff saved to https://phabricator.wikimedia.org/P73015 and previous config saved to /var/cache/conftool/dbconfig/20250201-205525-marostegui.json [20:55:28] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [20:55:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2173.codfw.wmnet with reason: Maintenance [20:55:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14:00:00 on db2186.codfw.wmnet with reason: Maintenance [20:56:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T384592)', diff saved to https://phabricator.wikimedia.org/P73016 and previous config saved to /var/cache/conftool/dbconfig/20250201-205602-marostegui.json [22:09:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T384592)', diff saved to https://phabricator.wikimedia.org/P73017 and previous config saved to /var/cache/conftool/dbconfig/20250201-220935-marostegui.json [22:24:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P73018 and previous config saved to /var/cache/conftool/dbconfig/20250201-222442-marostegui.json [22:39:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P73019 and previous config saved to /var/cache/conftool/dbconfig/20250201-223949-marostegui.json [22:54:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T384592)', diff saved to https://phabricator.wikimedia.org/P73020 and previous config saved to /var/cache/conftool/dbconfig/20250201-225456-marostegui.json [22:55:00] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [22:55:12] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7:00:00 on db2174.codfw.wmnet with reason: Maintenance [22:55:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T384592)', diff saved to https://phabricator.wikimedia.org/P73021 and previous config saved to /var/cache/conftool/dbconfig/20250201-225519-marostegui.json [23:16:29] FIRING: CertManagerCertNotReady: Certificate default/jayme-debug is not in a ready state (k8s-staging@codfw) - https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager - https://grafana.wikimedia.org/d/vo5tiJTnz?var-site=codfw&var-cluster=k8s-staging&var-namespace=default - https://alerts.wikimedia.org/?q=alertname%3DCertManagerCertNotReady