[00:01:25] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034944 (owner: 10TrainBranchBot) [00:01:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:06:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:17:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:28:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63142 and previous config saved to /var/cache/conftool/dbconfig/20240525-002835-marostegui.json [00:28:40] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:32:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:33:09] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:38:09] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:43:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P63143 and previous config saved to /var/cache/conftool/dbconfig/20240525-004343-marostegui.json [00:58:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P63144 and previous config saved to /var/cache/conftool/dbconfig/20240525-005851-marostegui.json [01:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63145 and previous config saved to /var/cache/conftool/dbconfig/20240525-011359-marostegui.json [01:14:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [01:14:04] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:14:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [01:14:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T364299)', diff saved to https://phabricator.wikimedia.org/P63146 and previous config saved to /var/cache/conftool/dbconfig/20240525-011423-marostegui.json [01:38:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:38:50] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:49:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:59:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:05:16] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:11:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364069)', diff saved to https://phabricator.wikimedia.org/P63147 and previous config saved to /var/cache/conftool/dbconfig/20240525-021154-marostegui.json [02:12:00] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T364299)', diff saved to https://phabricator.wikimedia.org/P63148 and previous config saved to /var/cache/conftool/dbconfig/20240525-021752-marostegui.json [02:17:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:21:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:27:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63149 and previous config saved to /var/cache/conftool/dbconfig/20240525-022703-marostegui.json [02:33:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P63150 and previous config saved to /var/cache/conftool/dbconfig/20240525-023300-marostegui.json [02:36:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P63151 and previous config saved to /var/cache/conftool/dbconfig/20240525-024211-marostegui.json [02:48:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P63152 and previous config saved to /var/cache/conftool/dbconfig/20240525-024808-marostegui.json [02:56:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T364069)', diff saved to https://phabricator.wikimedia.org/P63153 and previous config saved to /var/cache/conftool/dbconfig/20240525-025719-marostegui.json [02:57:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [02:57:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:57:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [02:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T364069)', diff saved to https://phabricator.wikimedia.org/P63154 and previous config saved to /var/cache/conftool/dbconfig/20240525-025742-marostegui.json [03:03:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T364299)', diff saved to https://phabricator.wikimedia.org/P63155 and previous config saved to /var/cache/conftool/dbconfig/20240525-030316-marostegui.json [03:03:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [03:03:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [03:03:21] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:36:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:38:09] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:48:09] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:51:47] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:47] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:33:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [04:42:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [04:43:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T364299)', diff saved to https://phabricator.wikimedia.org/P63156 and previous config saved to /var/cache/conftool/dbconfig/20240525-044304-marostegui.json [04:43:09] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:51:09] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:24:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364069)', diff saved to https://phabricator.wikimedia.org/P63157 and previous config saved to /var/cache/conftool/dbconfig/20240525-052423-marostegui.json [05:24:28] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:39:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63158 and previous config saved to /var/cache/conftool/dbconfig/20240525-053931-marostegui.json [05:43:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:51:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T364299)', diff saved to https://phabricator.wikimedia.org/P63159 and previous config saved to /var/cache/conftool/dbconfig/20240525-055125-marostegui.json [05:51:30] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:54:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P63160 and previous config saved to /var/cache/conftool/dbconfig/20240525-055439-marostegui.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:16] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:06:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P63161 and previous config saved to /var/cache/conftool/dbconfig/20240525-060633-marostegui.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T364069)', diff saved to https://phabricator.wikimedia.org/P63162 and previous config saved to /var/cache/conftool/dbconfig/20240525-060947-marostegui.json [06:09:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [06:09:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:10:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [06:10:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:10:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:10:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T364069)', diff saved to https://phabricator.wikimedia.org/P63163 and previous config saved to /var/cache/conftool/dbconfig/20240525-061028-marostegui.json [06:21:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P63164 and previous config saved to /var/cache/conftool/dbconfig/20240525-062141-marostegui.json [06:31:09] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:36:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T364299)', diff saved to https://phabricator.wikimedia.org/P63165 and previous config saved to /var/cache/conftool/dbconfig/20240525-063649-marostegui.json [06:36:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [06:36:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:37:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [06:37:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T364299)', diff saved to https://phabricator.wikimedia.org/P63166 and previous config saved to /var/cache/conftool/dbconfig/20240525-063712-marostegui.json [06:38:10] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:39:28] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:09] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:43:23] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:00:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:10:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:15:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:35:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364069)', diff saved to https://phabricator.wikimedia.org/P63167 and previous config saved to /var/cache/conftool/dbconfig/20240525-073533-marostegui.json [07:35:38] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:50:39] FIRING: [4x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:50:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63168 and previous config saved to /var/cache/conftool/dbconfig/20240525-075041-marostegui.json [07:55:39] FIRING: [4x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:05:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P63169 and previous config saved to /var/cache/conftool/dbconfig/20240525-080549-marostegui.json [08:15:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:20:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T364069)', diff saved to https://phabricator.wikimedia.org/P63170 and previous config saved to /var/cache/conftool/dbconfig/20240525-082057-marostegui.json [08:21:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:21:02] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:21:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:24:28] (03CR) 10Majavah: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (owner: 10EoghanGaffney) [08:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:52:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T364299)', diff saved to https://phabricator.wikimedia.org/P63171 and previous config saved to /var/cache/conftool/dbconfig/20240525-085250-marostegui.json [08:52:56] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:04:54] (03PS1) 10Volans: P:cumin: fix support for aliasing LVS host classes [puppet] - 10https://gerrit.wikimedia.org/r/1035848 [09:07:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035848 (owner: 10Volans) [09:07:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P63172 and previous config saved to /var/cache/conftool/dbconfig/20240525-090758-marostegui.json [09:10:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:10:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:14:35] (03CR) 10Volans: [C:03+2] "Self-merging to fix broken puppet. Happy to adapt in a later CR if needed." [puppet] - 10https://gerrit.wikimedia.org/r/1035848 (owner: 10Volans) [09:17:56] (03CR) 10Volans: [C:03+1] "FYI I had to send I5a932aa6d80a887f0ae93c3ab74454ac1a4f1e1b to fix puppet on cloudcumin and cuminunpriv hosts. Probably if we had used Hos" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [09:23:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P63173 and previous config saved to /var/cache/conftool/dbconfig/20240525-092306-marostegui.json [09:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T364299)', diff saved to https://phabricator.wikimedia.org/P63174 and previous config saved to /var/cache/conftool/dbconfig/20240525-093814-marostegui.json [09:38:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:38:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:38:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [09:41:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:41:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:55:18] (03PS3) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 [09:58:21] (03CR) 10EoghanGaffney: lists: Don't include automation in standby hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (owner: 10EoghanGaffney) [10:05:16] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:57:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [10:58:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:09:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [11:09:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1246.eqiad.wmnet with reason: Maintenance [11:09:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T364299)', diff saved to https://phabricator.wikimedia.org/P63175 and previous config saved to /var/cache/conftool/dbconfig/20240525-110931-marostegui.json [11:09:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:02:35] PROBLEM - Disk space on backup1011 is CRITICAL: DISK CRITICAL - free space: /srv/objectstorage 781089MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [12:15:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:20:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:35:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T364299)', diff saved to https://phabricator.wikimedia.org/P63176 and previous config saved to /var/cache/conftool/dbconfig/20240525-131236-marostegui.json [13:12:42] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:15:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:16:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2127.codfw.wmnet with reason: Maintenance [13:16:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2127 (T364069)', diff saved to https://phabricator.wikimedia.org/P63177 and previous config saved to /var/cache/conftool/dbconfig/20240525-131619-marostegui.json [13:16:24] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P63178 and previous config saved to /var/cache/conftool/dbconfig/20240525-132744-marostegui.json [13:30:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:33:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:33:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:33:52] (03PS1) 10Ebrahim: Change the Persian Wikibooks wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [13:34:55] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:35:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 9.006 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:37:13] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:42:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P63179 and previous config saved to /var/cache/conftool/dbconfig/20240525-134252-marostegui.json [13:46:00] (03PS2) 10Ebrahim: Use updated workmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [13:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T364299)', diff saved to https://phabricator.wikimedia.org/P63180 and previous config saved to /var/cache/conftool/dbconfig/20240525-135800-marostegui.json [13:58:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:58:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:58:06] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:05:16] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:11:28] (03PS3) 10Ebrahim: Use updated workmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [14:12:42] (03PS4) 10Ebrahim: Use updated workmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [14:16:37] (03PS5) 10Ebrahim: Use updated workmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [14:36:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:01] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:53:56] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2002:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:59] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:01] RESOLVED: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:10:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:15:09] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 796 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:35:25] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:36:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:38:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:38:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:39:53] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.162 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:40:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364069)', diff saved to https://phabricator.wikimedia.org/P63181 and previous config saved to /var/cache/conftool/dbconfig/20240525-155610-marostegui.json [15:56:16] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:09:30] FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63182 and previous config saved to /var/cache/conftool/dbconfig/20240525-161118-marostegui.json [16:14:30] RESOLVED: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:18] (03CR) 10Huji: [C:03+1] Use updated workmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) (owner: 10Ebrahim) [16:26:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P63183 and previous config saved to /var/cache/conftool/dbconfig/20240525-162627-marostegui.json [16:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T364069)', diff saved to https://phabricator.wikimedia.org/P63184 and previous config saved to /var/cache/conftool/dbconfig/20240525-164135-marostegui.json [16:41:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:41:40] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:41:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [16:44:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:44:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:45:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T364299)', diff saved to https://phabricator.wikimedia.org/P63185 and previous config saved to /var/cache/conftool/dbconfig/20240525-164506-marostegui.json [16:45:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:51:00] (03PS6) 10Ebrahim: Use updated wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [17:00:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:03:32] (03PS7) 10Ebrahim: Use updated wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [17:07:15] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 13Patch-Needs-Improvement: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#9832668 (10Pppery) [17:25:30] FIRING: ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:30] RESOLVED: ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:30:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:54:56] (03PS8) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [17:57:25] (03PS9) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [17:58:25] (03PS10) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:03:03] (03PS11) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:04:53] (03PS12) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:06:30] (03PS13) 10Ebrahim: Use updated tagline and wordmark of Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:08:37] (03PS14) 10Ebrahim: Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:17:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:18:05] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:21] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 5.144 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:19:55] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:26:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T364299)', diff saved to https://phabricator.wikimedia.org/P63186 and previous config saved to /var/cache/conftool/dbconfig/20240525-182637-marostegui.json [18:26:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:39:46] (03PS15) 10Ebrahim: Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [18:41:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63187 and previous config saved to /var/cache/conftool/dbconfig/20240525-184145-marostegui.json [18:56:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P63188 and previous config saved to /var/cache/conftool/dbconfig/20240525-185653-marostegui.json [19:00:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:06:47] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:09:00] (03PS16) 10Ebrahim: Update tagline and wordmark of Persian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035852 (https://phabricator.wikimedia.org/T365913) [19:12:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T364299)', diff saved to https://phabricator.wikimedia.org/P63189 and previous config saved to /var/cache/conftool/dbconfig/20240525-191201-marostegui.json [19:12:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:12:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:12:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [19:12:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:12:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [19:12:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T364299)', diff saved to https://phabricator.wikimedia.org/P63190 and previous config saved to /var/cache/conftool/dbconfig/20240525-191242-marostegui.json [19:30:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:30:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [19:30:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T364069)', diff saved to https://phabricator.wikimedia.org/P63191 and previous config saved to /var/cache/conftool/dbconfig/20240525-193047-marostegui.json [19:30:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:35:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:35:40] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:45:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:50:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:51:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:51:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:22:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T364299)', diff saved to https://phabricator.wikimedia.org/P63192 and previous config saved to /var/cache/conftool/dbconfig/20240525-202207-marostegui.json [20:22:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P63193 and previous config saved to /var/cache/conftool/dbconfig/20240525-203715-marostegui.json [20:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P63194 and previous config saved to /var/cache/conftool/dbconfig/20240525-205223-marostegui.json [21:07:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T364299)', diff saved to https://phabricator.wikimedia.org/P63195 and previous config saved to /var/cache/conftool/dbconfig/20240525-210731-marostegui.json [21:07:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:07:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:07:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [21:07:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T364299)', diff saved to https://phabricator.wikimedia.org/P63196 and previous config saved to /var/cache/conftool/dbconfig/20240525-210754-marostegui.json [21:15:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:21:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:22:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 2.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:57] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:07:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364069)', diff saved to https://phabricator.wikimedia.org/P63197 and previous config saved to /var/cache/conftool/dbconfig/20240525-220727-marostegui.json [22:07:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:19:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T364299)', diff saved to https://phabricator.wikimedia.org/P63198 and previous config saved to /var/cache/conftool/dbconfig/20240525-221936-marostegui.json [22:19:41] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:22:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63199 and previous config saved to /var/cache/conftool/dbconfig/20240525-222235-marostegui.json [22:34:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P63200 and previous config saved to /var/cache/conftool/dbconfig/20240525-223444-marostegui.json [22:37:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P63201 and previous config saved to /var/cache/conftool/dbconfig/20240525-223743-marostegui.json [22:49:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P63202 and previous config saved to /var/cache/conftool/dbconfig/20240525-224952-marostegui.json [22:52:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T364069)', diff saved to https://phabricator.wikimedia.org/P63203 and previous config saved to /var/cache/conftool/dbconfig/20240525-225251-marostegui.json [22:52:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:52:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:53:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [22:53:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:53:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [22:53:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T364069)', diff saved to https://phabricator.wikimedia.org/P63204 and previous config saved to /var/cache/conftool/dbconfig/20240525-225331-marostegui.json [23:05:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T364299)', diff saved to https://phabricator.wikimedia.org/P63205 and previous config saved to /var/cache/conftool/dbconfig/20240525-230500-marostegui.json [23:05:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [23:05:06] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:05:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [23:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T364299)', diff saved to https://phabricator.wikimedia.org/P63206 and previous config saved to /var/cache/conftool/dbconfig/20240525-230523-marostegui.json [23:06:47] FIRING: [4x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:35:40] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035866 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1035866 (owner: 10TrainBranchBot) [23:39:33] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:42:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:43:17] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:44:23] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:44:26] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:49:26] FIRING: [5x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:05] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state