[00:01:16] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35724/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802852 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [00:02:42] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35725/otrs1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/802853 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [00:04:44] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35726/otrs1001.eqiad.wmnet/change.otrs1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [00:07:25] (03PS2) 10Dzahn: vrts: rename exim4 templates from otrs to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) [00:08:35] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1003/35727/" [puppet] - 10https://gerrit.wikimedia.org/r/802851 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [00:09:49] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35728/" [puppet] - 10https://gerrit.wikimedia.org/r/802854 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [00:13:08] (03CR) 10Jforrester: Disable older WebM VP8 transcodes except 360p (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802665 (https://phabricator.wikimedia.org/T309823) (owner: 10Brion VIBBER) [00:22:03] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:29:37] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:59:01] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:59:03] RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [01:30:45] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:35:51] (03PS52) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [02:16:15] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:10] (03PS10) 10Ryan Kemper: query service: port cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [02:37:23] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Halfak) [02:39:20] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35729/console" [puppet] - 10https://gerrit.wikimedia.org/r/792104 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [02:57:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [02:59:21] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:01:29] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:16:41] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 17.94 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [03:44:54] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: try managing the second drive with late_command [puppet] - 10https://gerrit.wikimedia.org/r/802881 [03:48:53] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.893 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [03:52:40] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: try managing the second drive with late_command [puppet] - 10https://gerrit.wikimedia.org/r/802881 (owner: 10Andrew Bogott) [03:53:55] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [03:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [04:11:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:12:57] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:24:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [04:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [04:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:28:39] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [04:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [04:42:45] (03PS1) 10Andrew Bogott: clouddumps100x: further partitioning attempts [puppet] - 10https://gerrit.wikimedia.org/r/802882 [04:43:23] (03Abandoned) 10Andrew Bogott: clouddumps100x: yet another partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/802667 (owner: 10Andrew Bogott) [04:43:47] (03CR) 10CI reject: [V: 04-1] clouddumps100x: further partitioning attempts [puppet] - 10https://gerrit.wikimedia.org/r/802882 (owner: 10Andrew Bogott) [04:44:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:45:13] (03PS2) 10Andrew Bogott: clouddumps100x: further partitioning attempts [puppet] - 10https://gerrit.wikimedia.org/r/802882 [04:46:20] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100x: further partitioning attempts [puppet] - 10https://gerrit.wikimedia.org/r/802882 (owner: 10Andrew Bogott) [04:51:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:53:15] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host clouddumps1001.wikimedia.org with OS bullseye [04:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [04:53:30] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [04:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [04:53:43] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:59:59] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 0.84 ms [05:19:59] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:21:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [05:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [05:29:17] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:06:55] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-whole-mediawiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:37] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:30:19] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:07] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220604T0700) [07:25:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:25:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [07:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298560)', diff saved to https://phabricator.wikimedia.org/P29388 and previous config saved to /var/cache/conftool/dbconfig/20220604-072556-ladsgroup.json [07:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:00] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [07:26:09] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Lgaulia - https://phabricator.wikimedia.org/T309844 (10Aklapper) @MoritzMuehlenhoff Per SRE's steps added in https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLDAP&type=revision&diff=1929377&oldid=1924287 , @larissagaulia should be added to https://ph... [08:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:39:43] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:33:23] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:39:18] 10SRE, 10ops-eqiad: db1128 faulty memory - https://phabricator.wikimedia.org/T309291 (10Marostegui) Thank you Chris! [10:05:27] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:11:49] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:54:35] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [11:56:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [12:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:31:55] PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 54766 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops [12:37:09] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:13:59] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:29:05] (03PS1) 10Andrew Bogott: clouddumps100x: at this point just trying to establish a partman baseline [puppet] - 10https://gerrit.wikimedia.org/r/802888 [13:30:48] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100x: at this point just trying to establish a partman baseline [puppet] - 10https://gerrit.wikimedia.org/r/802888 (owner: 10Andrew Bogott) [13:31:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host clouddumps1001.wikimedia.org w... [13:56:00] (03PS1) 10Andrew Bogott: clouddumps100x: Check for a pxe regression [puppet] - 10https://gerrit.wikimedia.org/r/802889 [13:56:50] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host clouddumps1001.wikimedia.org with OS bullseye [13:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host clouddumps1001.wikimedia.org with... [13:57:18] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps100x: Check for a pxe regression [puppet] - 10https://gerrit.wikimedia.org/r/802889 (owner: 10Andrew Bogott) [13:58:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [13:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [13:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [14:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:33] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:48:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298560)', diff saved to https://phabricator.wikimedia.org/P29389 and previous config saved to /var/cache/conftool/dbconfig/20220604-154817-ladsgroup.json [15:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:23] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [15:57:07] (03PS1) 10Daimona Eaytoy: Remove references to $wgEnableLocalTimedText [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802894 [16:03:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29390 and previous config saved to /var/cache/conftool/dbconfig/20220604-160321-ladsgroup.json [16:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:41] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29391 and previous config saved to /var/cache/conftool/dbconfig/20220604-161827-ladsgroup.json [16:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298560)', diff saved to https://phabricator.wikimedia.org/P29392 and previous config saved to /var/cache/conftool/dbconfig/20220604-162110-ladsgroup.json [16:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:15] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:32:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Papaul) @Andrew sorry didn't have time yesterday to work on this was doing some planing for the codfw PDU's refresh. I took time... [16:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T298560)', diff saved to https://phabricator.wikimedia.org/P29393 and previous config saved to /var/cache/conftool/dbconfig/20220604-163332-ladsgroup.json [16:33:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:33:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:37] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [16:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T298560)', diff saved to https://phabricator.wikimedia.org/P29394 and previous config saved to /var/cache/conftool/dbconfig/20220604-163340-ladsgroup.json [16:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29395 and previous config saved to /var/cache/conftool/dbconfig/20220604-163615-ladsgroup.json [16:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P29396 and previous config saved to /var/cache/conftool/dbconfig/20220604-165120-ladsgroup.json [16:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:05] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) > Looping in @Andrew. @Kelson note that yes, we are installing new, more capable machines that... [17:06:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298560)', diff saved to https://phabricator.wikimedia.org/P29397 and previous config saved to /var/cache/conftool/dbconfig/20220604-170625-ladsgroup.json [17:06:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [17:06:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [17:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:31] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [17:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298560)', diff saved to https://phabricator.wikimedia.org/P29398 and previous config saved to /var/cache/conftool/dbconfig/20220604-170633-ladsgroup.json [17:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:17] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:43:39] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:49:03] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:09] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:27] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:42] (03PS1) 10Ladsgroup: os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 [18:52:21] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:52:30] (03CR) 10CI reject: [V: 04-1] os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [18:58:50] (03PS2) 10Ladsgroup: os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 [19:01:37] (03CR) 10CI reject: [V: 04-1] os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [19:06:54] (03PS3) 10Ladsgroup: os_reports: Make the reports look better [puppet] - 10https://gerrit.wikimedia.org/r/802897 [19:11:00] (03CR) 10Ladsgroup: "This is an example of its output: https://people.wikimedia.org/~ladsgroup/os-reports/os-report-2022-06-04-stretch.html" [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [20:01:53] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:11:10] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:22:45] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:29:41] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:41:15] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:48:11] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:57:25] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:59:45] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:26:03] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:15:23] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:50:01] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:28:35] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:28:51] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg from papaul's tests [puppet] - 10https://gerrit.wikimedia.org/r/802900 (https://phabricator.wikimedia.org/T302981) [23:29:14] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host clouddumps1001.wikimedia.org with OS bullseye [23:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:45] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg from papaul's tests [puppet] - 10https://gerrit.wikimedia.org/r/802900 (https://phabricator.wikimedia.org/T302981) (owner: 10Andrew Bogott) [23:32:02] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [23:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) >>! In T302981#7980873, @Papaul wrote: > @Andrew sorry didn't have time yesterday to work on this was doing some planing... [23:49:00] (03PS1) 10Andrew Bogott: hwraid-2dev.cfg: Try to get grub onto the boot partition [puppet] - 10https://gerrit.wikimedia.org/r/802904 [23:50:05] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host clouddumps1001.wikimedia.org with OS bullseye [23:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:21] (03CR) 10Andrew Bogott: [C: 03+2] hwraid-2dev.cfg: Try to get grub onto the boot partition [puppet] - 10https://gerrit.wikimedia.org/r/802904 (owner: 10Andrew Bogott) [23:51:19] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host clouddumps1001.wikimedia.org with OS bullseye [23:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log