[00:38:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118223 [00:38:19] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118223 (owner: 10TrainBranchBot) [00:44:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:48:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1118223 (owner: 10TrainBranchBot) [00:51:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T384592)', diff saved to https://phabricator.wikimedia.org/P73427 and previous config saved to /var/cache/conftool/dbconfig/20250209-005121-marostegui.json [00:51:25] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [00:54:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:06:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P73428 and previous config saved to /var/cache/conftool/dbconfig/20250209-010628-marostegui.json [01:08:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118224 [01:08:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118224 (owner: 10TrainBranchBot) [01:14:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:17:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10533915 (10phaultfinder) [01:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240', diff saved to https://phabricator.wikimedia.org/P73429 and previous config saved to /var/cache/conftool/dbconfig/20250209-012135-marostegui.json [01:29:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1118224 (owner: 10TrainBranchBot) [01:34:32] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:36:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2240 (T384592)', diff saved to https://phabricator.wikimedia.org/P73430 and previous config saved to /var/cache/conftool/dbconfig/20250209-013642-marostegui.json [01:36:46] T384592: Add normalization columns to categorylinks table - https://phabricator.wikimedia.org/T384592 [01:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:37:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:48] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:17:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:19:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:19:59] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS7195/IPv4: Connect - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:39:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:01] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - AS7195/IPv4: Active - EdgeUno https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250209T0800) [08:46:16] 07Puppet: Module uwsgi doesn't allow passing multiple config params of same name - https://phabricator.wikimedia.org/T123809#10533982 (10taavi) 05Open→03Resolved The module does support passing an array which will be correctly rendered as multiple statements in the generated file. [09:17:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:24:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:44:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [10:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:12:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:13:19] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: / 2776 MB (3% inode=84%): /tmp 2776 MB (3% inode=84%): /var/tmp 2776 MB (3% inode=84%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [12:45:37] (03CR) 10Ladsgroup: [C:03+1] Reformat with Black [cookbooks] - 10https://gerrit.wikimedia.org/r/1118098 (owner: 10Federico Ceratto) [13:21:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:29:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:29:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10534095 (10phaultfinder) [13:32:05] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:10] PROBLEM - Host cr2-magru is DOWN: PING CRITICAL - Packet loss = 100% [13:33:34] RECOVERY - Host cr2-magru is UP: PING OK - Packet loss = 0%, RTA = 115.74 ms [13:34:27] I'm somewhat here [13:34:37] shall I depool magru? [13:34:59] <_joe_> It was a flap [13:35:24] <_joe_> !incidents [13:35:25] 5668 (UNACKED) Host cr2-magru - PING - Packet loss = 100% [13:35:44] <_joe_> !resolve 5668 [13:35:44] 5668 (RESOLVED) Host cr2-magru - PING - Packet loss = 100% [13:36:30] yeah, I got it, I'm just worried it might flap again [13:36:37] if it does, we'll depool it [13:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [13:52:25] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cr2-magru with reason: IBGP instability from cr1 to cr2 in magru causing ping faulures from alert1002 [13:56:43] I downtimed the CR in magru again. We've had a few of these blips but I don't think it's causing sufficient instability to depool [13:58:22] 10ops-magru, 06Infrastructure-Foundations, 10netops: Jan 2025 - Magru core router connectivity blips - https://phabricator.wikimedia.org/T384774#10534126 (10cmooney) We got paged again for ping loss here. I have downtimed cr2-magru for two days and will work with Juniper tomorrow on the issue. We may need... [13:58:27] there is a case open with Juniper on it I'll be following up with them tomorrow [14:16:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10534156 (10phaultfinder) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:34:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [18:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:11:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic2085-production-search-psi-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:24:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10534294 (10phaultfinder) [21:39:33] FIRING: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:42:41] FIRING: [3x] SystemdUnitFailed: etcd-backup.service on aux-k8s-etcd2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:33] RESOLVED: Wikidata Reliability Metrics - Median loading time alert: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [22:42:49] FIRING: PuppetFailure: Puppet has failed on build2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:40:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:41:21] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:11] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53514 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.177 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring