[00:01:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052357 (owner: 10TrainBranchBot) [00:11:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367856)', diff saved to https://phabricator.wikimedia.org/P65887 and previous config saved to /var/cache/conftool/dbconfig/20240706-001105-marostegui.json [00:11:09] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:26:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P65888 and previous config saved to /var/cache/conftool/dbconfig/20240706-002612-marostegui.json [00:37:05] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.058 second response time https://wikitech.wikimedia.org/wiki/Swift [00:38:05] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Swift [00:41:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P65889 and previous config saved to /var/cache/conftool/dbconfig/20240706-004119-marostegui.json [00:43:07] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [00:44:05] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [00:56:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367856)', diff saved to https://phabricator.wikimedia.org/P65890 and previous config saved to /var/cache/conftool/dbconfig/20240706-005626-marostegui.json [00:56:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:56:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [00:56:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:56:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T367856)', diff saved to https://phabricator.wikimedia.org/P65891 and previous config saved to /var/cache/conftool/dbconfig/20240706-005648-marostegui.json [01:46:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:04] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:49:29] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [02:26:07] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [02:27:07] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Swift [02:27:15] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.238 second response time https://wikitech.wikimedia.org/wiki/Swift [02:29:15] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [02:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:07] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [02:42:07] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift [02:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:17] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [03:08:17] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Swift [03:11:19] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.926 second response time https://wikitech.wikimedia.org/wiki/Swift [03:12:19] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [03:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:49:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367856)', diff saved to https://phabricator.wikimedia.org/P65892 and previous config saved to /var/cache/conftool/dbconfig/20240706-034952-marostegui.json [03:49:56] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:05:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P65893 and previous config saved to /var/cache/conftool/dbconfig/20240706-040459-marostegui.json [04:20:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P65894 and previous config saved to /var/cache/conftool/dbconfig/20240706-042006-marostegui.json [04:35:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367856)', diff saved to https://phabricator.wikimedia.org/P65895 and previous config saved to /var/cache/conftool/dbconfig/20240706-043513-marostegui.json [04:35:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [04:35:17] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:35:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [04:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T367856)', diff saved to https://phabricator.wikimedia.org/P65896 and previous config saved to /var/cache/conftool/dbconfig/20240706-043535-marostegui.json [04:46:25] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.672 second response time https://wikitech.wikimedia.org/wiki/Swift [04:47:23] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Swift [05:09:15] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [05:10:13] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Swift [05:15:15] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [05:16:13] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [05:23:15] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [05:24:15] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [05:28:27] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 141 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:46:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:04] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:33] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 481.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:16:25] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [06:17:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 135 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:17:25] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [06:20:04] 06SRE: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414 (10Count_Count) 03NEW [06:20:33] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 3.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:23:26] 06SRE: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958071 (10Count_Count) [06:25:01] 06SRE: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958072 (10Count_Count) I haven't checked if this applies to _all_ `action=raw` requests. [06:25:30] 06SRE: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958073 (10Count_Count) [06:27:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 51 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:28:25] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:42:06] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958086 (10Count_Count) [06:51:17] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:52:17] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Swift [06:54:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:56:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:33] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 332.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:09:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367856)', diff saved to https://phabricator.wikimedia.org/P65897 and previous config saved to /var/cache/conftool/dbconfig/20240706-070927-marostegui.json [07:09:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:24:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P65898 and previous config saved to /var/cache/conftool/dbconfig/20240706-072434-marostegui.json [07:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:39:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P65899 and previous config saved to /var/cache/conftool/dbconfig/20240706-073941-marostegui.json [07:51:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:54:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367856)', diff saved to https://phabricator.wikimedia.org/P65900 and previous config saved to /var/cache/conftool/dbconfig/20240706-075448-marostegui.json [07:54:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [07:54:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:55:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [07:55:41] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958156 (10Count_Count) Using the Mediawiki REST API works good enough. So this is not that urgent for me anymore. [08:03:33] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 31.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:04:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 94373368 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:05:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2696 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:43:57] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958164 (10Bugreporter) when using curl, please specify a custom user agent. [08:58:38] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958175 (10Count_Count) >>! In T369414#9958164, @Bugreporter wrote: > when using curl, please specify a custom user agent. I do in my code, th... [08:59:57] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9958176 (10Count_Count) [09:48:04] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:24:23] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.653 second response time https://wikitech.wikimedia.org/wiki/Swift [11:25:21] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Swift [11:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:51:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:33] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 511.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:32:15] (03PS1) 10GergesShamon: [euwiki] Enable Visual Editor in namespace Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) [12:43:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052372 (https://phabricator.wikimedia.org/T368632) (owner: 10GergesShamon) [12:45:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [12:45:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [12:45:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T367856)', diff saved to https://phabricator.wikimedia.org/P65901 and previous config saved to /var/cache/conftool/dbconfig/20240706-124535-marostegui.json [12:45:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:48:04] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [14:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:51:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:29] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [16:14:29] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Swift [16:23:42] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9958341 (10Urbanecm) @Dzahn: I popul... [17:18:34] !log hnowlan@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [17:21:25] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [17:31:34] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9958368 (10hnowlan) It seems a bad frontend server was the source of these errors, and a rolling restart [[ https://grafana-rw.wikimedia.org/d/OPgmB1Eiz/swift?forceL... [17:36:54] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9958370 (10Urbanecm) a:05Urbanecm→... [17:41:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367856)', diff saved to https://phabricator.wikimedia.org/P65902 and previous config saved to /var/cache/conftool/dbconfig/20240706-174103-marostegui.json [17:41:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [17:48:05] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:56:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P65903 and previous config saved to /var/cache/conftool/dbconfig/20240706-175610-marostegui.json [18:11:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P65904 and previous config saved to /var/cache/conftool/dbconfig/20240706-181117-marostegui.json [18:15:19] 10SRE-swift-storage, 06Data-Persistence, 10MediaWiki-Uploading: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9958387 (10hnowlan) [18:26:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367856)', diff saved to https://phabricator.wikimedia.org/P65905 and previous config saved to /var/cache/conftool/dbconfig/20240706-182625-marostegui.json [18:26:28] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:53:45] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052188 (https://phabricator.wikimedia.org/T351202) (owner: 10Urbanecm) [18:56:03] (03PS1) 10Urbanecm: stewards: install python3-mwclient [puppet] - 10https://gerrit.wikimedia.org/r/1052382 (https://phabricator.wikimedia.org/T369429) [18:56:34] (03PS2) 10Urbanecm: stewards: install python3-mwclient [puppet] - 10https://gerrit.wikimedia.org/r/1052382 (https://phabricator.wikimedia.org/T369429) [18:58:53] (03PS1) 10Urbanecm: stewards: install python3-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1052383 (https://phabricator.wikimedia.org/T369322) [19:02:16] (03PS3) 10Urbanecm: stewards: install python3-mwclient [puppet] - 10https://gerrit.wikimedia.org/r/1052382 (https://phabricator.wikimedia.org/T369429) [19:02:16] (03PS2) 10Urbanecm: stewards: install python3-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1052383 (https://phabricator.wikimedia.org/T369322) [19:03:19] (03PS2) 10Urbanecm: stewards: Add Phabricator API configuration [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) [19:03:25] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm) [19:36:07] (03PS1) 10Urbanecm: stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) [19:36:29] (03CR) 10CI reject: [V:04-1] stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [19:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:37:33] (03PS2) 10Urbanecm: stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) [19:37:55] (03CR) 10CI reject: [V:04-1] stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [19:38:37] (03PS3) 10Urbanecm: stewards: clone user DB repo from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) [19:42:30] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052384 (https://phabricator.wikimedia.org/T369430) (owner: 10Urbanecm) [19:51:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:41] (03Abandoned) 10Urbanecm: stewards: install python3-mwclient [puppet] - 10https://gerrit.wikimedia.org/r/1052382 (https://phabricator.wikimedia.org/T369429) (owner: 10Urbanecm) [20:29:32] (03PS4) 10Urbanecm: stewards: install python3-phabricator [puppet] - 10https://gerrit.wikimedia.org/r/1052383 (https://phabricator.wikimedia.org/T369322) [20:35:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:35:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:33] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:36:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52338 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:49:32] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:49:44] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [23:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052388 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052388 (owner: 10TrainBranchBot) [23:51:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed