[00:01:29] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043948 (owner: 10TrainBranchBot) [00:25:14] (03PS1) 10BCornwall: acme-chief: Preparatory PyYAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1043979 [01:00:44] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 338.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:16:44] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P64993 and previous config saved to /var/cache/conftool/dbconfig/20240615-015320-marostegui.json [01:53:26] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:54:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64994 and previous config saved to /var/cache/conftool/dbconfig/20240615-015458-ladsgroup.json [01:55:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:00:47] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:08:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P64995 and previous config saved to /var/cache/conftool/dbconfig/20240615-020827-marostegui.json [02:10:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64996 and previous config saved to /var/cache/conftool/dbconfig/20240615-021005-ladsgroup.json [02:23:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P64997 and previous config saved to /var/cache/conftool/dbconfig/20240615-022335-marostegui.json [02:25:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64998 and previous config saved to /var/cache/conftool/dbconfig/20240615-022512-ladsgroup.json [02:30:44] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:38:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P64999 and previous config saved to /var/cache/conftool/dbconfig/20240615-023842-marostegui.json [02:38:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [02:38:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:38:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [02:39:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65000 and previous config saved to /var/cache/conftool/dbconfig/20240615-023904-marostegui.json [02:40:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P65001 and previous config saved to /var/cache/conftool/dbconfig/20240615-024019-ladsgroup.json [02:40:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [02:40:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:40:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:19:38] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:35:46] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:55:47] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:02:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [05:02:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:02:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:02:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T367261)', diff saved to https://phabricator.wikimedia.org/P65002 and previous config saved to /var/cache/conftool/dbconfig/20240615-050236-marostegui.json [05:02:41] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:33:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T367261)', diff saved to https://phabricator.wikimedia.org/P65003 and previous config saved to /var/cache/conftool/dbconfig/20240615-053346-marostegui.json [05:33:52] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:48:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P65004 and previous config saved to /var/cache/conftool/dbconfig/20240615-054854-marostegui.json [05:53:40] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (logstash1023), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:04:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P65005 and previous config saved to /var/cache/conftool/dbconfig/20240615-060401-marostegui.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T367261)', diff saved to https://phabricator.wikimedia.org/P65006 and previous config saved to /var/cache/conftool/dbconfig/20240615-061908-marostegui.json [06:19:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:19:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [06:19:13] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:19:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T367261)', diff saved to https://phabricator.wikimedia.org/P65007 and previous config saved to /var/cache/conftool/dbconfig/20240615-061919-marostegui.json [06:26:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65008 and previous config saved to /var/cache/conftool/dbconfig/20240615-062631-marostegui.json [06:26:36] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:26:52] (03PS1) 10Phedenskog: grafana: Change synthetic performance test proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1044292 (https://phabricator.wikimedia.org/T367488) [06:33:12] (03CR) 10Phedenskog: "Hi @colewhite I missed one change in https://phabricator.wikimedia.org/T367064 that Filippo helped me with: we have the JSON plugin in Gra" [puppet] - 10https://gerrit.wikimedia.org/r/1044292 (https://phabricator.wikimedia.org/T367488) (owner: 10Phedenskog) [06:33:38] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:41:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65009 and previous config saved to /var/cache/conftool/dbconfig/20240615-064138-marostegui.json [06:53:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:56:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65010 and previous config saved to /var/cache/conftool/dbconfig/20240615-065645-marostegui.json [06:59:38] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:08:46] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65011 and previous config saved to /var/cache/conftool/dbconfig/20240615-071152-marostegui.json [07:11:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [07:11:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:12:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [07:12:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65012 and previous config saved to /var/cache/conftool/dbconfig/20240615-071215-marostegui.json [08:03:46] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:10] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [09:27:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [09:27:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P65013 and previous config saved to /var/cache/conftool/dbconfig/20240615-092730-ladsgroup.json [09:27:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:19:38] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 321.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:27:38] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 53.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:12:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65014 and previous config saved to /var/cache/conftool/dbconfig/20240615-111229-marostegui.json [11:12:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:27:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65015 and previous config saved to /var/cache/conftool/dbconfig/20240615-112736-marostegui.json [11:42:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65016 and previous config saved to /var/cache/conftool/dbconfig/20240615-114243-marostegui.json [11:57:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65017 and previous config saved to /var/cache/conftool/dbconfig/20240615-115750-marostegui.json [11:57:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:57:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:58:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:58:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65018 and previous config saved to /var/cache/conftool/dbconfig/20240615-115812-marostegui.json [13:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:41:40] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 322.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:26:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.368 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:20:59] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9895594 (10VirginiaPoundstone) /api/rest_v1/metrics/commons-impact is my vote... [15:21:54] (03PS1) 10EoghanGaffney: lists: Allow 'some files vanished' errors in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1044731 [15:56:59] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2930/co" [puppet] - 10https://gerrit.wikimedia.org/r/1044731 (owner: 10EoghanGaffney) [16:01:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65019 and previous config saved to /var/cache/conftool/dbconfig/20240615-160149-marostegui.json [16:01:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:03:06] (03PS2) 10EoghanGaffney: lists: Allow 'some files vanished' errors in rsync [puppet] - 10https://gerrit.wikimedia.org/r/1044731 (https://phabricator.wikimedia.org/T367627) [16:16:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65020 and previous config saved to /var/cache/conftool/dbconfig/20240615-161656-marostegui.json [16:32:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65021 and previous config saved to /var/cache/conftool/dbconfig/20240615-163203-marostegui.json [16:47:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65022 and previous config saved to /var/cache/conftool/dbconfig/20240615-164710-marostegui.json [16:47:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:47:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [16:47:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T364069)', diff saved to https://phabricator.wikimedia.org/P65023 and previous config saved to /var/cache/conftool/dbconfig/20240615-164732-marostegui.json [17:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:42:44] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:44] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 319.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:32:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364069)', diff saved to https://phabricator.wikimedia.org/P65024 and previous config saved to /var/cache/conftool/dbconfig/20240615-203229-marostegui.json [20:32:34] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:47:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P65025 and previous config saved to /var/cache/conftool/dbconfig/20240615-204735-marostegui.json [21:01:28] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Idle - Telia, AS1299/IPv4: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:02:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P65026 and previous config saved to /var/cache/conftool/dbconfig/20240615-210243-marostegui.json [21:08:56] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 71 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:13:56] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 7 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:17:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T364069)', diff saved to https://phabricator.wikimedia.org/P65027 and previous config saved to /var/cache/conftool/dbconfig/20240615-211750-marostegui.json [21:17:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:17:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:18:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:18:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T364069)', diff saved to https://phabricator.wikimedia.org/P65028 and previous config saved to /var/cache/conftool/dbconfig/20240615-211811-marostegui.json [21:28:40] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Connect - Telia, AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:30:58] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 80 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:35:58] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:47:25] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648 (10phaultfinder) 03NEW [23:18:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P65029 and previous config saved to /var/cache/conftool/dbconfig/20240615-231822-ladsgroup.json [23:18:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:33:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65030 and previous config saved to /var/cache/conftool/dbconfig/20240615-233329-ladsgroup.json [23:37:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1045124 [23:37:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1045124 (owner: 10TrainBranchBot) [23:48:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P65031 and previous config saved to /var/cache/conftool/dbconfig/20240615-234836-ladsgroup.json