[00:20:29] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:37] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:05] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:28] 10SRE, 10SRE-Access-Requests: Shell access request for @demon - https://phabricator.wikimedia.org/T311314 (10Legoktm) Yesssss. Welcome back!! [02:01:19] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:02:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:03:30] here, looking [02:06:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:10:27] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:20] looks like it was a spike of requests lasting about five minutes, but I'm not sure what exactly -- we got some 503s from swift (https://grafana.wikimedia.org/goto/PO9qB1q7z) but also high load from thumbor (https://grafana.wikimedia.org/goto/YWOkYJq7k) which looks like it was mostly djvu (https://grafana.wikimedia.org/goto/1HSiYJq7k) so I'm not sure which of those was a cause and which was an effect [02:19:01] leaving it there for now though [02:31:43] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:23] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:27] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:19:17] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:19:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:29:23] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:45:01] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [06:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:29:25] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220626T0700) [07:46:55] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [07:49:17] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:42:53] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 48.21 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [08:45:17] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 101 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [09:33:33] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:34:29] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=CREATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:34:55] RECOVERY - SSH on restbase1018.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:49:28] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808124 (https://phabricator.wikimedia.org/T311104) (owner: 10Labdajiwa) [11:51:05] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:27:22] Is there someone tech expert? Getting some issue while we visit RC on metawiki. [12:27:26] This search has timed out. You may wish to try different search parameters. [12:35:04] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@b3fe77c]: (no justification provided) [12:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:09] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@b3fe77c]: (no justification provided) (duration: 00m 04s) [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:21] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:59] (03PS1) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) [13:35:35] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:06:19] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:37:11] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:05:39] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:53:03] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:55:15] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 19 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [16:49:03] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 2 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [18:01:11] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:02:27] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:59:15] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:55] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:02] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [20:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:01] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:09] PROBLEM - SSH on restbase1018.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:58:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1027.mgmt.eqiad.wmnet with reboot policy FORCED [20:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1028.mgmt.eqiad.wmnet with reboot policy FORCED [20:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1029.mgmt.eqiad.wmnet with reboot policy FORCED [21:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1030.mgmt.eqiad.wmnet with reboot policy FORCED [21:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:19] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:13:45] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:18:04] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1030.mgmt.eqiad.wmnet with reboot policy FORCED [21:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1028.mgmt.eqiad.wmnet with reboot policy FORCED [21:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1027.mgmt.eqiad.wmnet with reboot policy FORCED [21:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1029.mgmt.eqiad.wmnet with reboot policy FORCED [21:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1032.mgmt.eqiad.wmnet with reboot policy FORCED [21:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1031.mgmt.eqiad.wmnet with reboot policy FORCED [21:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1033.mgmt.eqiad.wmnet with reboot policy FORCED [21:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cloudcephosd1034.mgmt.eqiad.wmnet with reboot policy FORCED [21:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1034.mgmt.eqiad.wmnet with reboot policy FORCED [21:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:23] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1033.mgmt.eqiad.wmnet with reboot policy FORCED [21:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:27] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1032.mgmt.eqiad.wmnet with reboot policy FORCED [21:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1031.mgmt.eqiad.wmnet with reboot policy FORCED [21:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) Both ports have been updated in netbox, bios has been setup [21:43:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) [21:49:16] (03PS1) 10Cmjohnson: Add new cloudcephosd servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/808541 (https://phabricator.wikimedia.org/T294972) [21:50:43] (03CR) 10Cmjohnson: [C: 03+2] Add new cloudcephosd servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/808541 (https://phabricator.wikimedia.org/T294972) (owner: 10Cmjohnson) [22:05:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) @Jclark-ctr any word on getting this server fixed? [22:08:31] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [22:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:02] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:11:07] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:12:43] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q3:(Need By: TBD) rack/setup/install stat1009 - https://phabricator.wikimedia.org/T299466 (10Cmjohnson) [22:13:33] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:34:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) @btullis @robh was working on this last wee. /dev/sda and /dev/sdb are swapped by the controller regardless of how they were inputted. It appears a partman recipe cha... [22:35:00] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [22:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:25] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host stat1009.mgmt.eqiad.wmnet with reboot policy FORCED [22:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:39] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS buster [22:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS... [22:59:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) @Andrew I am not sure which raid configuration you need. I don't know what cloudcephosd1020 has going other I see a /dev/sda a... [23:00:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10Cmjohnson) [23:10:23] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [23:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:27] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host conf1007.mgmt.eqiad.wmnet with reboot policy FORCED [23:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:58] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS buster [23:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudcephosd10[25-34] - https://phabricator.wikimedia.org/T294972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudcephosd1025.eqiad.wmnet with OS bus... [23:20:26] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host conf1008.mgmt.eqiad.wmnet with reboot policy FORCED [23:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host conf1009.mgmt.eqiad.wmnet with reboot policy FORCED [23:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf1009.mgmt.eqiad.wmnet with reboot policy FORCED [23:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf1008.mgmt.eqiad.wmnet with reboot policy FORCED [23:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:05] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host conf1007.mgmt.eqiad.wmnet with reboot policy FORCED [23:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10Cmjohnson) [23:43:36] (03PS1) 10Cmjohnson: Adding stat1009 and stat1010 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/808545 (https://phabricator.wikimedia.org/T307399) [23:43:40] (03PS1) 10Cmjohnson: Adding conf1007-9 to site.pp and and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/808546 (https://phabricator.wikimedia.org/T301272) [23:46:28] (03Abandoned) 10Cmjohnson: Adding stat1009 and stat1010 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/808545 (https://phabricator.wikimedia.org/T307399) (owner: 10Cmjohnson) [23:46:38] (03CR) 10Cmjohnson: [C: 03+2] Adding conf1007-9 to site.pp and and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/808546 (https://phabricator.wikimedia.org/T301272) (owner: 10Cmjohnson) [23:46:47] (03PS2) 10Cmjohnson: Adding conf1007-9 to site.pp and and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/808546 (https://phabricator.wikimedia.org/T301272) [23:47:01] (03CR) 10Cmjohnson: [V: 03+2] Adding conf1007-9 to site.pp and and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/808546 (https://phabricator.wikimedia.org/T301272) (owner: 10Cmjohnson) [23:51:38] (03PS1) 10Cmjohnson: updating site.pp for stat1009 and stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/808547 (https://phabricator.wikimedia.org/T299466) [23:52:32] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp for stat1009 and stat1010 [puppet] - 10https://gerrit.wikimedia.org/r/808547 (https://phabricator.wikimedia.org/T299466) (owner: 10Cmjohnson) [23:57:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host conf1007.eqiad.wmnet with OS bullseye [23:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host conf1007.eqiad.wmnet with OS bullseye