[00:00:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:21] PROBLEM - Check systemd state on aphlict1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:27] (03CR) 10RLazarus: [C: 03+2] Revert "trafficserver: Temporarily disable mwdebug on kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/745873 (owner: 10Ladsgroup) [00:16:41] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 6.833e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:42:39] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:11:33] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.13e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:14:01] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:15:43] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:09] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10RLazarus) [06:01:03] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:01:59] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:37:12] (03CR) 104nn1l2: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [07:42:37] (03PS5) 104nn1l2: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) [08:21:57] (03CR) 104nn1l2: Remove redundant project namespace aliases (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [09:44:16] (03PS12) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [10:05:27] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:08:29] (03PS7) 10Fomafix: Add language codes sr-cyrl and sr-latn next to sr-ec and sr-el [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [11:06:25] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:38:04] (03PS10) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [12:38:42] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [12:39:18] (03PS11) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [12:39:53] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [12:48:17] (03PS12) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [12:48:19] (03PS1) 10Jbond: puppet_compiler: client_max_body_size nginx [puppet] - 10https://gerrit.wikimedia.org/r/745980 [12:49:04] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [12:50:28] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: client_max_body_size nginx [puppet] - 10https://gerrit.wikimedia.org/r/745980 (owner: 10Jbond) [12:52:11] (03PS13) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [12:53:07] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [12:56:50] (03PS14) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [12:57:47] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [13:05:53] (03PS2) 10Jbond: puppet_compiler: client_max_body_size nginx [puppet] - 10https://gerrit.wikimedia.org/r/745980 [13:09:00] (03CR) 10Jbond: [C: 03+2] puppet_compiler: client_max_body_size nginx [puppet] - 10https://gerrit.wikimedia.org/r/745980 (owner: 10Jbond) [13:14:13] (03PS1) 10Jbond: puppet_compiler::upload: Restart when settings change [puppet] - 10https://gerrit.wikimedia.org/r/745982 [14:12:17] (03PS15) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [14:12:52] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [15:40:45] 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Zabe) [16:12:44] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [16:24:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:26:33] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:28:22] (03PS1) 10Jbond: puppet_compiler: add signiture verification [puppet] - 10https://gerrit.wikimedia.org/r/745986 [16:30:24] (03CR) 10Jbond: [C: 03+2] puppet_compiler: add signiture verification [puppet] - 10https://gerrit.wikimedia.org/r/745986 (owner: 10Jbond) [16:32:45] (03CR) 10Jbond: [C: 03+2] puppet_compiler::upload: Restart when settings change [puppet] - 10https://gerrit.wikimedia.org/r/745982 (owner: 10Jbond) [16:52:33] (03PS16) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:10:50] (03PS17) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:12:29] (03PS1) 10Jbond: puppet_compiler: update realms type [puppet] - 10https://gerrit.wikimedia.org/r/745987 [17:13:10] (03PS2) 10Jbond: puppet_compiler: update realms type [puppet] - 10https://gerrit.wikimedia.org/r/745987 [17:13:17] (03CR) 10Jbond: [C: 03+2] puppet_compiler: update realms type [puppet] - 10https://gerrit.wikimedia.org/r/745987 (owner: 10Jbond) [17:14:15] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:25:07] (03PS1) 10Jbond: puppet_compiler::uploader: add web user permissions [puppet] - 10https://gerrit.wikimedia.org/r/745988 [17:25:50] (03CR) 10Jbond: [C: 03+2] puppet_compiler::uploader: add web user permissions [puppet] - 10https://gerrit.wikimedia.org/r/745988 (owner: 10Jbond) [17:49:28] (03PS18) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:50:13] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [17:51:28] (03PS19) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:52:12] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [17:55:02] (03PS20) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [18:00:08] (03PS21) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [18:01:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32958/console" [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [18:15:15] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:02:58] (03PS1) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [19:03:19] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) The memory dump of APCu is in my home directory in mw1414, I'm planning to take a look in Monday, t... [19:03:59] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster [19:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:44] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) Another note: Most of OOMs in appservers seems to be coming from rest.php (is rest.php being served... [19:37:04] (03PS2) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [19:45:17] (03PS1) 10Jbond: puppet_compiler: drop yaml dir from export facts tar ball [puppet] - 10https://gerrit.wikimedia.org/r/745990 [19:45:46] (03CR) 10Jbond: puppet_compiler: add pcc facts processor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [19:48:22] (03PS3) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [19:48:45] (03PS4) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [19:49:03] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) Note that for Parsoid itself, errors ilke this `Allowed memory size of 1468006400 bytes exhausted (tr... [20:50:41] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:30:19] (03PS5) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [21:44:24] (03PS6) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [21:54:42] (03PS7) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [21:55:16] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [22:48:22] (03PS8) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [22:53:49] (03PS9) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [22:54:48] (03CR) 10Jbond: "this is now working" [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [22:55:27] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10daniel) >>! In T297517#7564757, @ssastry wrote: > Are rest.php requests on the app servers to the Parsoid REST... [22:56:49] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10daniel) By the way, @tstarling has recently been working on optimizing Parsoid. If it's Parsoid related, he ma... [22:58:17] (03PS10) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [23:13:39] (03PS11) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [23:14:16] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [23:20:54] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:21:01] 10SRE: Frequent backend server errors (503), happened several times in the last 2 days - https://phabricator.wikimedia.org/T297544 (10Yann) Again now: `Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 529632607 Error: 503, Backend fetch failed at Sat, 11 Dec 2021 23:18:47 GMT` for https://archive.org/do... [23:42:13] (03PS12) 10Jbond: puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 [23:42:49] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: add pcc facts processor [puppet] - 10https://gerrit.wikimedia.org/r/745989 (owner: 10Jbond) [23:53:58] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook