[00:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312863)', diff saved to https://phabricator.wikimedia.org/P31747 and previous config saved to /var/cache/conftool/dbconfig/20220723-001125-ladsgroup.json [00:11:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [00:11:30] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [00:11:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [00:11:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [00:12:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [00:13:10] 10SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634 (10RLazarus) p:05Triage→03Low [01:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31748 and previous config saved to /var/cache/conftool/dbconfig/20220723-010745-ladsgroup.json [01:07:51] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [01:18:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:18:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:22:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31749 and previous config saved to /var/cache/conftool/dbconfig/20220723-012250-ladsgroup.json [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31750 and previous config saved to /var/cache/conftool/dbconfig/20220723-013755-ladsgroup.json [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31751 and previous config saved to /var/cache/conftool/dbconfig/20220723-015300-ladsgroup.json [01:53:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [01:53:05] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [01:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [02:17:45] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:09] 10SRE, 10Discovery, 10observability: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163 (10Aklapper) [04:01:34] 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Aklapper) [04:21:51] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:22:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:18:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:18:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:29:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [05:29:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1101.eqiad.wmnet with reason: Maintenance [05:29:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31752 and previous config saved to /var/cache/conftool/dbconfig/20220723-052925-ladsgroup.json [05:29:30] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [05:30:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:35:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:35:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:35:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:35:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:36:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T312863)', diff saved to https://phabricator.wikimedia.org/P31753 and previous config saved to /var/cache/conftool/dbconfig/20220723-053604-ladsgroup.json [05:36:08] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [05:49:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:30:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:32:59] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:45:35] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:50:41] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:52:15] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220723T0700) [07:09:21] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:29:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:11:19] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:12:21] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:43:37] 10SRE, 10Data-Engineering, 10Discovery, 10wmde-team-b-tech: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10Aklapper) [09:52:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:52:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [09:52:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312863)', diff saved to https://phabricator.wikimedia.org/P31754 and previous config saved to /var/cache/conftool/dbconfig/20220723-095241-ladsgroup.json [09:52:45] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:07:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31755 and previous config saved to /var/cache/conftool/dbconfig/20220723-100713-ladsgroup.json [10:07:17] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:07:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312863)', diff saved to https://phabricator.wikimedia.org/P31756 and previous config saved to /var/cache/conftool/dbconfig/20220723-100722-ladsgroup.json [10:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31757 and previous config saved to /var/cache/conftool/dbconfig/20220723-101250-ladsgroup.json [10:12:54] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:22:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31758 and previous config saved to /var/cache/conftool/dbconfig/20220723-102218-ladsgroup.json [10:22:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31759 and previous config saved to /var/cache/conftool/dbconfig/20220723-102227-ladsgroup.json [10:27:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P31760 and previous config saved to /var/cache/conftool/dbconfig/20220723-102755-ladsgroup.json [10:28:06] (03PS1) 10Stang: Allow admin to grant/revoke "transwiki" group on zh(wikt|wb|wq|ws) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816239 (https://phabricator.wikimedia.org/T313657) [10:35:12] (03PS1) 10Stang: ruwikivoyage: Add "suppressredirect" right to "filemover" group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/816242 (https://phabricator.wikimedia.org/T313614) [10:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31761 and previous config saved to /var/cache/conftool/dbconfig/20220723-103723-ladsgroup.json [10:37:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31762 and previous config saved to /var/cache/conftool/dbconfig/20220723-103733-ladsgroup.json [10:43:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P31763 and previous config saved to /var/cache/conftool/dbconfig/20220723-104300-ladsgroup.json [10:52:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31764 and previous config saved to /var/cache/conftool/dbconfig/20220723-105228-ladsgroup.json [10:52:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:52:35] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312863)', diff saved to https://phabricator.wikimedia.org/P31765 and previous config saved to /var/cache/conftool/dbconfig/20220723-105238-ladsgroup.json [10:52:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:52:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:52:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31766 and previous config saved to /var/cache/conftool/dbconfig/20220723-105257-ladsgroup.json [10:58:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31767 and previous config saved to /var/cache/conftool/dbconfig/20220723-105805-ladsgroup.json [10:58:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:58:12] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [10:58:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [10:58:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1100 (T312863)', diff saved to https://phabricator.wikimedia.org/P31768 and previous config saved to /var/cache/conftool/dbconfig/20220723-105825-ladsgroup.json [11:06:43] (03PS4) 10Aklapper: Phabricator: add override for the browser time zone conflict message [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712) [11:44:18] 10SRE, 10Wikimedia-Mailing-lists: Create mailing list for WikiSound Audio Speaks Campaign - https://phabricator.wikimedia.org/T311230 (10Aklapper) 05Open→03Resolved Resolving per last comment so tasks don't linger. [12:16:59] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:29] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:30:19] 10SRE: Survey the third-party library market for UA policy compliance - https://phabricator.wikimedia.org/T313634 (10CDanis) [13:30:26] 10SRE, 10Traffic-Icebox, 10Patch-For-Review, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10CDanis) [13:58:19] (03PS2) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814725 (https://phabricator.wikimedia.org/T304947) [13:58:25] (03PS7) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [13:58:31] (03PS2) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 [13:58:37] (03PS4) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [14:34:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312863)', diff saved to https://phabricator.wikimedia.org/P31769 and previous config saved to /var/cache/conftool/dbconfig/20220723-143414-ladsgroup.json [14:34:21] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [14:49:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31770 and previous config saved to /var/cache/conftool/dbconfig/20220723-144920-ladsgroup.json [14:51:35] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31771 and previous config saved to /var/cache/conftool/dbconfig/20220723-150425-ladsgroup.json [15:07:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31772 and previous config saved to /var/cache/conftool/dbconfig/20220723-150754-ladsgroup.json [15:07:59] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:19:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312863)', diff saved to https://phabricator.wikimedia.org/P31773 and previous config saved to /var/cache/conftool/dbconfig/20220723-151930-ladsgroup.json [15:19:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:19:35] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:19:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [15:19:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T312863)', diff saved to https://phabricator.wikimedia.org/P31774 and previous config saved to /var/cache/conftool/dbconfig/20220723-151951-ladsgroup.json [15:21:06] Hello, what is the reason for such an action from this account? [15:21:06] https://w.wiki/5VQr [15:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31775 and previous config saved to /var/cache/conftool/dbconfig/20220723-152300-ladsgroup.json [15:23:09] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:29:24] MdsShakil, a botnet attack [15:35:49] zabe The hacker had access to the server! Concernsable [15:38:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31776 and previous config saved to /var/cache/conftool/dbconfig/20220723-153805-ladsgroup.json [15:38:36] MdsShakil: no, the hacker didn't have access to Wikimedia servers [15:39:14] It was part of a system activated to automatically restrict IPs who made certain actions when the wikis were facing a lot of vandalism [15:42:02] Those things should be cleared correctly in summary of the action [15:53:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31777 and previous config saved to /var/cache/conftool/dbconfig/20220723-155311-ladsgroup.json [15:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:53:17] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [15:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31778 and previous config saved to /var/cache/conftool/dbconfig/20220723-155530-ladsgroup.json [16:10:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31779 and previous config saved to /var/cache/conftool/dbconfig/20220723-161035-ladsgroup.json [16:19:53] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Horcrux92) Same error on it.wiki when tr... [16:23:11] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31780 and previous config saved to /var/cache/conftool/dbconfig/20220723-162540-ladsgroup.json [16:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31781 and previous config saved to /var/cache/conftool/dbconfig/20220723-164045-ladsgroup.json [16:40:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:40:50] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [16:41:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1122.eqiad.wmnet with reason: Maintenance [16:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T312863)', diff saved to https://phabricator.wikimedia.org/P31782 and previous config saved to /var/cache/conftool/dbconfig/20220723-164105-ladsgroup.json [17:24:37] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:01:45] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists, 10SecTeam-Processed, and 2 others: Mailman3 XSS via username on hyperkitty tags API - https://phabricator.wikimedia.org/T312506 (10MoritzMuehlenhoff) This has been assigned CVE-2018-25045 [18:02:02] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists, 10SecTeam-Processed, and 2 others: Mailman3 XSS via username on hyperkitty tags API (CVE-2018-25045) - https://phabricator.wikimedia.org/T312506 (10MoritzMuehlenhoff) [18:04:47] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:07:51] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:29:21] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:40:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [20:40:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [20:40:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T312863)', diff saved to https://phabricator.wikimedia.org/P31783 and previous config saved to /var/cache/conftool/dbconfig/20220723-204049-ladsgroup.json [20:40:54] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:41:44] 10SRE, 10serviceops, 10serviceops-collab, 10Patch-For-Review: Onboarding for Arnold Okoth - https://phabricator.wikimedia.org/T288645 (10Aklapper) 05Open→03Resolved @Arnoldokoth: No reply; closing [20:50:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312863)', diff saved to https://phabricator.wikimedia.org/P31784 and previous config saved to /var/cache/conftool/dbconfig/20220723-205054-ladsgroup.json [20:50:59] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:06:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31785 and previous config saved to /var/cache/conftool/dbconfig/20220723-210559-ladsgroup.json [21:21:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31786 and previous config saved to /var/cache/conftool/dbconfig/20220723-212105-ladsgroup.json [21:22:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312863)', diff saved to https://phabricator.wikimedia.org/P31787 and previous config saved to /var/cache/conftool/dbconfig/20220723-212204-ladsgroup.json [21:22:11] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:25:24] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) Also: https://grep.be/blog//en/computer/Planet_Grep_now_running_PtLink/ [21:30:51] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312863)', diff saved to https://phabricator.wikimedia.org/P31788 and previous config saved to /var/cache/conftool/dbconfig/20220723-213610-ladsgroup.json [21:36:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [21:36:16] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:36:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [21:37:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31789 and previous config saved to /var/cache/conftool/dbconfig/20220723-213710-ladsgroup.json [21:52:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31790 and previous config saved to /var/cache/conftool/dbconfig/20220723-215215-ladsgroup.json [22:07:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312863)', diff saved to https://phabricator.wikimedia.org/P31791 and previous config saved to /var/cache/conftool/dbconfig/20220723-220720-ladsgroup.json [22:07:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [22:07:25] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:07:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [22:07:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T312863)', diff saved to https://phabricator.wikimedia.org/P31792 and previous config saved to /var/cache/conftool/dbconfig/20220723-220740-ladsgroup.json [22:44:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312863)', diff saved to https://phabricator.wikimedia.org/P31793 and previous config saved to /var/cache/conftool/dbconfig/20220723-224412-ladsgroup.json [22:44:18] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [22:59:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P31794 and previous config saved to /var/cache/conftool/dbconfig/20220723-225917-ladsgroup.json [23:06:51] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:14:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100', diff saved to https://phabricator.wikimedia.org/P31795 and previous config saved to /var/cache/conftool/dbconfig/20220723-231422-ladsgroup.json [23:25:43] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:28:09] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.063 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:29:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1100 (T312863)', diff saved to https://phabricator.wikimedia.org/P31796 and previous config saved to /var/cache/conftool/dbconfig/20220723-232927-ladsgroup.json [23:29:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:29:32] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [23:29:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1096.eqiad.wmnet with reason: Maintenance [23:29:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31797 and previous config saved to /var/cache/conftool/dbconfig/20220723-232948-ladsgroup.json [23:51:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T312863)', diff saved to https://phabricator.wikimedia.org/P31798 and previous config saved to /var/cache/conftool/dbconfig/20220723-235136-ladsgroup.json [23:51:42] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863