[00:02:02] (03PS1) 10Dzahn: phabricator: add phab1004 to list of phab hosts for firewall [puppet] - 10https://gerrit.wikimedia.org/r/824802 (https://phabricator.wikimedia.org/T280597) [00:08:02] (03PS1) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) [00:10:30] (03PS1) 10Dzahn: phabricator: remove phab1001 and phab2001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [00:11:55] (03PS1) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) [00:29:16] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:53:42] PROBLEM - snapshot of s5 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s5 at eqiad (db1150) taken on 2022-08-20 00:11:00 is 510 GiB, but the previous one was 648 GiB, a change of -21.4 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:25:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [01:25:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [01:26:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T314041)', diff saved to https://phabricator.wikimedia.org/P32622 and previous config saved to /var/cache/conftool/dbconfig/20220820-012602-ladsgroup.json [01:26:06] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) resolved: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:15:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:26] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:27:46] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:08] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:08] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:40] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:54:00] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:55:00] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:50] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:00] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:14] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:54] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:20] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:06] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:39:30] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:44:18] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:34] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:40] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:16] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:04] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:00] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:56:17] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM! Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/822422 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [04:59:46] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (phab2002), Fresh: 114 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:28:40] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T314041)', diff saved to https://phabricator.wikimedia.org/P32623 and previous config saved to /var/cache/conftool/dbconfig/20220820-060949-ladsgroup.json [06:09:54] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:11:34] (03CR) 10ArielGlenn: "Hannah ran PCC on this, results look ok to me, should check to see if we're missing anything before rolling this out Monday. https://puppe" [puppet] - 10https://gerrit.wikimedia.org/r/824690 (https://phabricator.wikimedia.org/T271736) (owner: 10ArielGlenn) [06:23:26] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32624 and previous config saved to /var/cache/conftool/dbconfig/20220820-062455-ladsgroup.json [06:30:30] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32625 and previous config saved to /var/cache/conftool/dbconfig/20220820-064001-ladsgroup.json [06:55:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T314041)', diff saved to https://phabricator.wikimedia.org/P32626 and previous config saved to /var/cache/conftool/dbconfig/20220820-065507-ladsgroup.json [06:55:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:55:13] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:55:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [06:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T314041)', diff saved to https://phabricator.wikimedia.org/P32627 and previous config saved to /var/cache/conftool/dbconfig/20220820-065528-ladsgroup.json [06:58:12] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220820T0700) [07:00:36] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:38] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:19:50] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:21:46] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:23:36] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:24:10] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 9 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:24:26] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:26:40] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:32] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:35:30] PROBLEM - SSH on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:50] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:36:18] PROBLEM - SSH on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:37:36] RECOVERY - SSH on wdqs1015 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:08] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:38:24] RECOVERY - SSH on wdqs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:44:56] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:49:30] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [08:54:12] PROBLEM - SSH on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:56:16] RECOVERY - SSH on wdqs1014 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:23:10] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:44] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:40] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:24:10] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:26:34] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 14 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [10:42:58] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:53:13] 10SRE, 10ExternalGuidance, 10Traffic-Icebox: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10jcrespo) > In order to focus on content and reduce the number of tools exposed in the UI chrome that may not work on an external context, we Could this be elabo... [11:12:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T314041)', diff saved to https://phabricator.wikimedia.org/P32628 and previous config saved to /var/cache/conftool/dbconfig/20220820-111238-ladsgroup.json [11:12:42] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32629 and previous config saved to /var/cache/conftool/dbconfig/20220820-112744-ladsgroup.json [11:42:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32630 and previous config saved to /var/cache/conftool/dbconfig/20220820-114249-ladsgroup.json [11:44:32] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:56] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T314041)', diff saved to https://phabricator.wikimedia.org/P32631 and previous config saved to /var/cache/conftool/dbconfig/20220820-115755-ladsgroup.json [11:57:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:58:00] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:58:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [11:58:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32632 and previous config saved to /var/cache/conftool/dbconfig/20220820-115816-ladsgroup.json [12:18:54] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 875341 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [12:23:22] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:56] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:29:44] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:08] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:02] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [13:49:28] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 22 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [14:44:00] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:46:22] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:16] 10SRE, 10Infrastructure-Foundations, 10Mail: Wiki email not delivered to GMail - https://phabricator.wikimedia.org/T243937 (10Aklapper) Anyone knows if this is still a problem? [15:03:48] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:56] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:15:38] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:12] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 7 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [15:29:23] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824444 (https://phabricator.wikimedia.org/T315715) (owner: 10MarcoAurelio) [16:52:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32633 and previous config saved to /var/cache/conftool/dbconfig/20220820-165203-ladsgroup.json [16:52:10] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32634 and previous config saved to /var/cache/conftool/dbconfig/20220820-170711-ladsgroup.json [17:22:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P32635 and previous config saved to /var/cache/conftool/dbconfig/20220820-172217-ladsgroup.json [17:35:20] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:37:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32636 and previous config saved to /var/cache/conftool/dbconfig/20220820-173723-ladsgroup.json [17:37:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:37:42] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [17:40:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:40:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [17:40:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [17:41:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 9 hosts with reason: Maintenance [17:47:36] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:01:26] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:05:04] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:05:24] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:20:32] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:22:54] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 6 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:34:48] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:39:32] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 17 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [18:53:04] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:53:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:55:44] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:02:44] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:40:30] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:45:16] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:00] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:23:20] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 9 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [20:55:40] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:05:16] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:05:34] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:18:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:18:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [22:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T314041)', diff saved to https://phabricator.wikimedia.org/P32637 and previous config saved to /var/cache/conftool/dbconfig/20220820-221826-ladsgroup.json [22:18:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:34:30] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:38:50] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 3 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator