[00:00:05] brennen: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220211T0000). [00:00:05] nn1l2 and Jdlrobson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:09] o/ [00:00:10] hi [00:00:16] I'll be sitting in for brennen|afk [00:00:28] hi nn1l2 [00:00:34] o/ present [00:00:34] hi [00:02:35] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:03:09] getting setup [00:03:15] PROBLEM - ElasticSearch setting check - 9400 on elastic1057 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [00:05:22] (03PS1) 10Clare Ming: Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) [00:06:51] (03CR) 10Clare Ming: "err'd on side of matching all vector config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [00:06:58] bwang: o/ [00:07:36] (03CR) 10Thcipriani: [C: 03+2] urwiki: Add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761700 (https://phabricator.wikimedia.org/T301491) (owner: 104nn1l2) [00:08:24] (03Merged) 10jenkins-bot: urwiki: Add patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761700 (https://phabricator.wikimedia.org/T301491) (owner: 104nn1l2) [00:13:53] @nn1l2 your patch is on mwdebug1001, can you check pls? [00:13:59] خن [00:14:02] ok [00:14:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T298554)', diff saved to https://phabricator.wikimedia.org/P20594 and previous config saved to /var/cache/conftool/dbconfig/20220211-001425-ladsgroup.json [00:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [00:14:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:14:30] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [00:14:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [00:14:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [00:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [00:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:01] bwang: LGTM, please sync [00:15:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:15:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:11] !log bwang@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:761700|urwiki: Add patroller usergroup (T301491)]] (duration: 00m 49s) [00:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:16] T301491: Requesting for "patroller" user group to be created on Urdu Wikipedia. - https://phabricator.wikimedia.org/T301491 [00:16:27] ^ nn1l2 should be live everywhere [00:16:35] Thanks! [00:17:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:01] PROBLEM - ElasticSearch setting check - 9400 on elastic1076 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [00:19:41] (03CR) 10Thcipriani: [C: 03+2] Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) (owner: 10Jdlrobson) [00:20:19] (03Merged) 10jenkins-bot: Make Vector 2022 the default skin for MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761739 (https://phabricator.wikimedia.org/T298519) (owner: 10Jdlrobson) [00:22:57] Jdlrobson: you change is live on mwdebug1002, check please! [00:23:02] *your [00:23:49] thcipriani: testing thanks [00:24:41] LGTM thcipriani [00:27:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:57] PROBLEM - ElasticSearch setting check - 9400 on elastic1068 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [00:28:35] Jdlrobson: cool, should wmf-config/config/mediawikiwiki.yaml go before dblists/desktop-improvements.dblist or does the order matter? [00:29:27] I believe the dblist is derived from the yaml [00:29:38] ah, ok [00:29:44] I think the yaml is just source code, not used in production [00:30:03] got it, thanks for that, I'll go ahead and sync that first just to keep everything in sync [00:30:16] going live now [00:31:26] !log thcipriani@deploy1002 Synchronized wmf-config/config/mediawikiwiki.yaml: Config: [[gerrit:761739|Make Vector 2022 the default skin for MediaWiki.org (T298519)]] (duration: 00m 48s) [00:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:31] T298519: Turn on desktop improvements on new set of pilot wikis - https://phabricator.wikimedia.org/T298519 [00:31:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:31:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:22] !log thcipriani@deploy1002 Synchronized dblists/desktop-improvements.dblist: Config: [[gerrit:761739|Make Vector 2022 the default skin for MediaWiki.org (T298519)]] (duration: 00m 48s) [00:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:29] ^ should be live now! [00:33:36] thcipriani: nice! thanks! [00:34:13] (03CR) 10Jdlrobson: [C: 04-1] "wmf-config/InitialiseSettings-labs.php will extend the values in wmf-config/InitialiseSettings.php (enwiki, wikipedia and default entries)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [00:38:38] !log utc late backport {{done}} [00:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:57] thanks for the deploy bwang ! [00:39:01] PROBLEM - ElasticSearch setting check - 9600 on elastic1075 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [00:39:31] (03PS2) 10Clare Ming: Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) [00:39:45] (03CR) 10Clare Ming: Update Beta cluster configuration for new and existing accounts (039 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [00:42:23] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:57] PROBLEM - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [00:43:51] PROBLEM - ElasticSearch setting check - 9600 on elastic1073 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [01:08:45] PROBLEM - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [01:13:37] PROBLEM - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [01:21:30] (03CR) 10Jdlrobson: [C: 03+1] Update Beta cluster configuration for new and existing accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761772 (https://phabricator.wikimedia.org/T301166) (owner: 10Clare Ming) [01:32:40] Sorry for the elasticsearch setting check noise - back near a computer in ~35 mins and will take a look [01:33:02] No action required currently, the cluster itself is fine [01:40:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:50:21] PROBLEM - ElasticSearch setting check - 9600 on elastic1083 is CRITICAL: CRITICAL - .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [01:53:59] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: rearrange the logic that decides what to mount [puppet] - 10https://gerrit.wikimedia.org/r/761783 [02:22:51] (03PS1) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [02:25:47] (03PS2) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [02:28:15] (03PS1) 10Ryan Kemper: Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/761748 [02:28:28] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/761748 (owner: 10Ryan Kemper) [02:30:03] (03PS3) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [02:34:57] (03PS4) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [02:51:40] (03PS5) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [02:52:26] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 (owner: 10Andrew Bogott) [02:59:18] (03PS6) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [03:00:07] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 (owner: 10Andrew Bogott) [03:10:52] (03PS7) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [03:11:38] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 (owner: 10Andrew Bogott) [04:03:20] (03PS8) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [05:11:31] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:16:19] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [05:21:06] (03CR) 10Santhosh: [C: 03+1] Enable ULS webfonts by default on trwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [05:23:23] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20595 and previous config saved to /var/cache/conftool/dbconfig/20220211-053752-marostegui.json [05:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:59] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:42:31] RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [05:42:31] RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [05:42:31] RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [05:42:54] (03CR) 10Marostegui: [C: 03+1] First sweep of clean up of tendril [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [05:46:44] (03PS9) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [05:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P20596 and previous config saved to /var/cache/conftool/dbconfig/20220211-055256-marostegui.json [05:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:31] !log Remove watchdog@10.% user from s6 codfw T301442 [05:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:36] T301442: Audit and remove watchdog user - https://phabricator.wikimedia.org/T301442 [05:58:11] (03PS1) 10Marostegui: Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761749 [05:59:27] (03CR) 10Marostegui: [C: 03+2] Revert "db2093: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/761749 (owner: 10Marostegui) [06:03:50] (03PS10) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [06:04:01] (03CR) 10Santhosh: [C: 03+1] Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) (owner: 10KartikMistry) [06:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P20597 and previous config saved to /var/cache/conftool/dbconfig/20220211-060801-marostegui.json [06:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20598 and previous config saved to /var/cache/conftool/dbconfig/20220211-062306-marostegui.json [06:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [06:23:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [06:23:13] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:23:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [06:23:15] RECOVERY - SSH on wtp1027.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [06:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:41] (03PS1) 10Majavah: admin: fix GenSysadminsTable.py [puppet] - 10https://gerrit.wikimedia.org/r/761789 [06:41:12] (03PS1) 10Marostegui: db_inventory.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/761791 (https://phabricator.wikimedia.org/T268869) [06:43:24] (03CR) 10Marostegui: [C: 03+2] db_inventory.my.cnf.erb: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/761791 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [06:52:40] (03PS1) 10Marostegui: tendril.cnf.erb: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761793 (https://phabricator.wikimedia.org/T297605) [06:54:15] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/pcc-worker1001/33742/" [puppet] - 10https://gerrit.wikimedia.org/r/761793 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [07:28:43] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:13] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [07:40:27] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:57] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [07:53:09] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre: catch WMCS Prometheus scrape failures [alerts] - 10https://gerrit.wikimedia.org/r/761604 (https://phabricator.wikimedia.org/T301376) (owner: 10Filippo Giunchedi) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220211T0800) [08:26:23] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:33:34] (03CR) 10Muehlenhoff: [C: 03+1] "Thank you, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/761789 (owner: 10Majavah) [08:33:36] (03CR) 10Muehlenhoff: [C: 03+2] admin: fix GenSysadminsTable.py [puppet] - 10https://gerrit.wikimedia.org/r/761789 (owner: 10Majavah) [08:36:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS buster [08:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:47] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster [08:39:51] (03PS1) 10Kevin Bazira: ml-services: remove editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761877 (https://phabricator.wikimedia.org/T301412) [08:57:26] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1011.eqiad.wmnet with OS buster [08:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:34] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster executed with errors: - ganeti1011 (*... [08:57:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) @Cmjohnson thanks! Do you have an idea how long ms-fe1012 is going to be needed for testing, please? [09:00:57] (03PS1) 10Marostegui: change_wll_sender_T300662.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) [09:02:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove watchlist group from s1 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20599 and previous config saved to /var/cache/conftool/dbconfig/20220211-090223-marostegui.json [09:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:29] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [09:13:00] (03CR) 10Elukey: [C: 03+2] ml-services: remove editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/761877 (https://phabricator.wikimedia.org/T301412) (owner: 10Kevin Bazira) [09:27:41] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:29:10] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:49] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:42:33] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 391 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:44:53] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:05:15] (03PS1) 10Btullis: Remove the old AQS nodes from the aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/761884 (https://phabricator.wikimedia.org/T297803) [10:05:37] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply [10:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:54] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply [10:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1021.eqiad.wmnet with OS buster [10:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS buster [10:16:29] (03PS3) 10Jelto: helmfiles: log helmfile deploy only once in SAL [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 [10:20:33] (03CR) 10Jelto: helmfiles: log helmfile deploy only once in SAL (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 (owner: 10Jelto) [10:35:35] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: increase CPU and memory allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [10:36:07] (03PS1) 10Filippo Giunchedi: network: make slice_network_constants work outside of module [puppet] - 10https://gerrit.wikimedia.org/r/761888 (https://phabricator.wikimedia.org/T291946) [10:36:09] (03PS1) 10Filippo Giunchedi: hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) [10:36:11] (03PS1) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [10:37:59] (03CR) 10jerkins-bot: [V: 04-1] network: make slice_network_constants work outside of module [puppet] - 10https://gerrit.wikimedia.org/r/761888 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:39:36] (03Merged) 10jenkins-bot: changeprop-jobqueue: increase CPU and memory allocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/761677 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [10:39:36] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:39:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1021.eqiad.wmnet with OS buster [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS buster completed: - ganeti1021 (**PASS**)... [10:40:37] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync on production [10:40:39] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [10:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:05] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [10:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:19] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync on production [10:42:21] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [10:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [10:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable ULS webfonts by default on trwikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/694315 (https://phabricator.wikimedia.org/T283626) (owner: 10Superyetkin) [10:42:44] (03PS2) 10Filippo Giunchedi: network: make slice_network_constants work outside of module [puppet] - 10https://gerrit.wikimedia.org/r/761888 (https://phabricator.wikimedia.org/T291946) [10:42:46] (03PS2) 10Filippo Giunchedi: hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) [10:42:48] (03PS2) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [10:42:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync on production [10:42:54] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:42:55] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [10:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [10:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:01] (03CR) 10jerkins-bot: [V: 04-1] prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:48:11] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Volans) >>! In T300568#7702770, @Dzahn wrote: > I picked the last option and found out if you set the "ip" parame... [10:50:32] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:53:53] (03PS1) 10Muehlenhoff: Remove auth2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/761895 (https://phabricator.wikimedia.org/T301546) [11:22:54] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated tests: introduce testcase for different executable paths [puppet] - 10https://gerrit.wikimedia.org/r/761684 (https://phabricator.wikimedia.org/T284767) [11:28:55] (03PS3) 10Arturo Borrero Gonzalez: toolforge: automated tests: introduce testcase for different executable paths [puppet] - 10https://gerrit.wikimedia.org/r/761684 (https://phabricator.wikimedia.org/T284767) [11:31:08] (03PS4) 10Arturo Borrero Gonzalez: toolforge: automated tests: introduce testcase for different executable paths [puppet] - 10https://gerrit.wikimedia.org/r/761684 (https://phabricator.wikimedia.org/T284767) [11:39:10] 10SRE, 10WMDE-GeoInfo-FocusArea, 10JavaScript, 10Maps (Kartographer): Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10thiemowmde) [11:40:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated tests: introduce testcase for different executable paths [puppet] - 10https://gerrit.wikimedia.org/r/761684 (https://phabricator.wikimedia.org/T284767) (owner: 10Arturo Borrero Gonzalez) [11:49:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) Ok have done some testing on netbox-next with the improved script (manually edited on netbox-dev2001 to test). ##... [12:13:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1016.eqiad.wmnet with OS buster [12:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:57] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS buster [12:17:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) #### Provision Server Test ###### Test Setup Firstly I created a dummy test server set to status 'planned' for... [12:38:35] there is a ganeti host, that seems to be swapping, if someone can have a look later: ganeti1022 [12:38:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) #### Move Server Within Row Test Lastly we need to test the behaviour if a device is moved from one cab to anothe... [12:41:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1016.eqiad.wmnet with OS buster [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:36] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1016.eqiad.wmnet with OS buster completed: - ganeti1016 (**PASS**)... [12:48:55] (03PS4) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [12:49:37] (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [12:53:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS buster [12:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:59] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster [13:00:12] (03PS5) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [13:00:48] (03CR) 10jerkins-bot: [V: 04-1] Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) (owner: 10Cathal Mooney) [13:01:15] (03PS3) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [13:01:17] (03PS1) 10Filippo Giunchedi: role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) [13:02:52] (03CR) 10jerkins-bot: [V: 04-1] role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:03:07] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:51] (03PS2) 10Filippo Giunchedi: role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) [13:04:53] (03PS4) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [13:05:49] (03PS1) 10Jbond: P:puppet_compiler: Add proxy site for cumin connections [puppet] - 10https://gerrit.wikimedia.org/r/761911 [13:05:54] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33750/console" [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:07:16] (03PS3) 10Filippo Giunchedi: hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) [13:07:18] (03PS3) 10Filippo Giunchedi: role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) [13:07:20] (03PS5) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [13:09:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33753/console" [puppet] - 10https://gerrit.wikimedia.org/r/761911 (owner: 10Jbond) [13:09:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppet_compiler: Add proxy site for cumin connections [puppet] - 10https://gerrit.wikimedia.org/r/761911 (owner: 10Jbond) [13:10:13] (03PS4) 10Filippo Giunchedi: hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) [13:10:15] (03PS4) 10Filippo Giunchedi: role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) [13:10:18] (03PS6) 10Filippo Giunchedi: prometheus: split probes by network sphere and address family [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) [13:13:35] (03PS6) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [13:15:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:15:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300662)', diff saved to https://phabricator.wikimedia.org/P20605 and previous config saved to /var/cache/conftool/dbconfig/20220211-131507-marostegui.json [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:14] T300662: Make wikilove_log.wll_sender/wll_receiver unsigned on wmf wikis - https://phabricator.wikimedia.org/T300662 [13:17:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:17:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:18:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [13:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1011.eqiad.wmnet with OS buster [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:39] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster executed with errors: - ganeti1011 (*... [13:19:46] (03PS2) 10Marostegui: change_wll_sender_T300662.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) [13:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 10%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20606 and previous config saved to /var/cache/conftool/dbconfig/20220211-132028-root.json [13:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:05] (03CR) 10Marostegui: "This has been tested with db1098:3316" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) (owner: 10Marostegui) [13:33:14] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10MoritzMuehlenhoff) I tried to reinstall again and after running a few tests with fdisk (d-i only has a very limited tool set provided by busybox) it turns out there's an additional disk showing up on that s... [13:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 25%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20607 and previous config saved to /var/cache/conftool/dbconfig/20220211-133533-root.json [13:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:50] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) [13:40:56] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) (owner: 10Marostegui) [13:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:41:37] (03PS1) 10Jelto: aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.7 [puppet] - 10https://gerrit.wikimedia.org/r/761917 (https://phabricator.wikimedia.org/T300967) [13:42:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761917 (https://phabricator.wikimedia.org/T300967) (owner: 10Jelto) [13:45:10] (03PS3) 10Marostegui: change_wll_sender_T300662.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) [13:46:04] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_wll_sender_T300662.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/761879 (https://phabricator.wikimedia.org/T300662) (owner: 10Marostegui) [13:49:52] (03CR) 10Ladsgroup: [C: 03+1] tendril.cnf.erb: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761793 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [13:50:05] (03PS2) 10Ladsgroup: First sweep of clean up of tendril [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) [13:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 50%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20608 and previous config saved to /var/cache/conftool/dbconfig/20220211-135037-root.json [13:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] First sweep of clean up of tendril [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [13:52:53] (03CR) 10Muehlenhoff: "odules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200#b161" [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [13:55:25] (03CR) 10Ladsgroup: First sweep of clean up of tendril (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761671 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [13:55:48] (03CR) 10Marostegui: [C: 03+2] tendril.cnf.erb: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761793 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [13:57:02] (03PS1) 10Ladsgroup: install_server: Drop dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/761918 (https://phabricator.wikimedia.org/T297605) [13:57:53] (03CR) 10Marostegui: [C: 03+1] install_server: Drop dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/761918 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [13:59:34] (03CR) 10Muehlenhoff: [C: 03+1] install_server: Drop dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/761918 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:02:27] (03PS1) 10Ladsgroup: Second sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761919 (https://phabricator.wikimedia.org/T297605) [14:02:44] (03PS1) 10Marostegui: tendril/maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761920 (https://phabricator.wikimedia.org/T297605) [14:04:20] (03PS2) 10Ladsgroup: install_server: Drop dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/761918 (https://phabricator.wikimedia.org/T297605) [14:04:23] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] install_server: Drop dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/761918 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:04:47] (03PS2) 10Ladsgroup: Second sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761919 (https://phabricator.wikimedia.org/T297605) [14:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 75%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20609 and previous config saved to /var/cache/conftool/dbconfig/20220211-140540-root.json [14:05:42] (03PS7) 10Cathal Mooney: Cleanup new interface creation and add logic to remove orphan ints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/761598 (https://phabricator.wikimedia.org/T301392) [14:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:56] (03CR) 10Ladsgroup: [C: 03+1] tendril/maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761920 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [14:06:09] (03CR) 10Marostegui: [C: 03+2] tendril/maintenance.pp: Remove file [puppet] - 10https://gerrit.wikimedia.org/r/761920 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [14:06:44] (03CR) 10Marostegui: [C: 03+1] Second sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761919 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:07:12] (03PS3) 10Ladsgroup: Second sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761919 (https://phabricator.wikimedia.org/T297605) [14:07:15] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Second sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761919 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:07:47] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts auth2001.codfw.wmnet [14:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:55] (03PS2) 10Muehlenhoff: Remove auth2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/761895 (https://phabricator.wikimedia.org/T301546) [14:10:52] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) Riccardo noticed an issue on the provision server and move server tests. In both case the "created interface" act... [14:12:44] (03CR) 10Muehlenhoff: [C: 03+2] Remove auth2001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/761895 (https://phabricator.wikimedia.org/T301546) (owner: 10Muehlenhoff) [14:15:25] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10akosiaris) >>! In T300977#7700987, @Ottomata wrote: > Hahah, maybe what we should do is excludelist the internal domains i... [14:15:49] (03PS1) 10Ladsgroup: Third sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) [14:19:29] (03CR) 10Ladsgroup: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1003/33754/db1115.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3316 (re)pooling @ 100%: repooling after schema change', diff saved to https://phabricator.wikimedia.org/P20610 and previous config saved to /var/cache/conftool/dbconfig/20220211-142045-root.json [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:22] (03CR) 10Marostegui: "Once this is merged, run puppet on db1115 and db2093 to make sure nothing broke there. I don't recall if they still use anything from the " [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:21:26] (03CR) 10Marostegui: [C: 03+1] Third sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:22:45] (03CR) 10Ladsgroup: Third sweep of tendril clean up (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:22:49] (03PS2) 10Ladsgroup: Third sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) [14:22:52] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Third sweep of tendril clean up [puppet] - 10https://gerrit.wikimedia.org/r/761921 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:23:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts auth2001.codfw.wmnet [14:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:42] (03CR) 10Elukey: [C: 03+1] Remove the old AQS nodes from the aqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/761884 (https://phabricator.wikimedia.org/T297803) (owner: 10Btullis) [14:27:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb={PATCH,POST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:27:54] 10SRE, 10ops-eqiad: Allocate new cabs for WMCS in rows E/F Eqiad - https://phabricator.wikimedia.org/T301414 (10cmooney) 05Open→03Resolved a:03cmooney @Jclark-ctr updated the other ticket to say we'd go with E4/F4 for these. >>! In T301419#7701383, @Jclark-ctr wrote: > updated task for new cage Rack E4,... [14:27:56] 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission auth2001.codfw.wmnet - https://phabricator.wikimedia.org/T301546 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03Papaul [14:30:05] 10SRE, 10ops-eqiad: 8 x SMF Patches between cages Eqiad - LVS & WMCS - https://phabricator.wikimedia.org/T301419 (10cmooney) [14:30:08] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:33:18] (03PS1) 10Vgutierrez: haproxy::tls_terminator: Support IPv[46] backends [puppet] - 10https://gerrit.wikimedia.org/r/761926 (https://phabricator.wikimedia.org/T290005) [14:33:56] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10akosiaris) >>! In T300977#7701074, @mpopov wrote: > A couple of questions/comments: > >>>! In T300977#7700842, @jbond wro... [14:34:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33755/console" [puppet] - 10https://gerrit.wikimedia.org/r/761926 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:35:35] (03PS1) 10Ladsgroup: mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) [14:38:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:39:16] (03CR) 10Ladsgroup: "clouddb2001-dev.codfw.wmnet is on stretch as well. Contacted wmcs people. I don't know if it uses this module though." [puppet] - 10https://gerrit.wikimedia.org/r/761927 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [14:40:26] (03PS1) 10Vgutierrez: cache::haproxy: Switch from UNIX sockets to TCP on cp4031 [puppet] - 10https://gerrit.wikimedia.org/r/761928 (https://phabricator.wikimedia.org/T290005) [14:42:06] (03PS2) 10Vgutierrez: cache::haproxy: Switch from UNIX sockets to TCP on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/761928 (https://phabricator.wikimedia.org/T290005) [14:43:14] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33756/console" [puppet] - 10https://gerrit.wikimedia.org/r/761928 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:47:43] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] haproxy::tls_terminator: Support IPv[46] backends [puppet] - 10https://gerrit.wikimedia.org/r/761926 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:53:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Switch from UNIX sockets to TCP on cp4032 [puppet] - 10https://gerrit.wikimedia.org/r/761928 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:56:09] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10akosiaris) >>! In T298723#7683885, @jcrespo wrote: >>>! In T298723#7683849, @AndyRussG wrote: > >> Ah ok thanks for taking care with this... Can you describe any more details abou... [15:00:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: make slice_network_constants work outside of module [puppet] - 10https://gerrit.wikimedia.org/r/761888 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:00:52] (03PS1) 10Majavah: O:mariadb::cloudinfra: update comments and motd role [puppet] - 10https://gerrit.wikimedia.org/r/761929 [15:02:45] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace - https://phabricator.wikimedia.org/T301563 (10BTullis) [15:03:05] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace - https://phabricator.wikimedia.org/T301563 (10BTullis) [15:06:22] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Hive - set hive.warehouse.subdir.inherit.perms = false [puppet] - 10https://gerrit.wikimedia.org/r/734368 (https://phabricator.wikimedia.org/T291664) (owner: 10Ottomata) [15:12:25] 10SRE, 10vm-requests: eqiad: 1 VM requested for karapace - https://phabricator.wikimedia.org/T301563 (10MoritzMuehlenhoff) Looks good, please pick row_A for this one, it has the most capacity ATM. [15:32:39] (03PS1) 10Majavah: hieradata: update wmcs ntp servers after bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/761933 [15:46:11] (03PS2) 10KartikMistry: Enable SectionTranslation in Occitan and Luganda [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761626 (https://phabricator.wikimedia.org/T301443) [15:51:03] (03PS1) 10KartikMistry: Fixed typo for SectionTranslation in testwiki: lu -> lg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761937 [15:54:40] (03PS1) 10Hnowlan: changeprop-jobqueue: decrease CPU use, increase number of nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/761938 (https://phabricator.wikimedia.org/T300914) [15:55:09] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: decrease CPU use, increase number of nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/761938 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [15:56:35] (03PS4) 10Krinkle: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [15:59:49] (03CR) 10Cparle: [C: 03+1] [WikibaseMediaInfo] Make synonyms profile the default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761915 (https://phabricator.wikimedia.org/T301559) (owner: 10Matthias Mullie) [16:03:44] !log btullis@cumin1001 START - Cookbook sre.ganeti.makevm for new host datahubsearch1001.eqiad.wmnet [16:03:45] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10Ladsgroup) Sorry for that, you needed to be added to wmf ldap group as well which is done now. Can you try and let me k... [16:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:02] (03CR) 10Krinkle: [C: 04-1] "It seems most of the others are local variables as well. They are not global and don't need an unused second one in the same function:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:04:05] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10BTullis) I'm creating the first of these three now. I've added `datahubsearch` as a new server prefix here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Server... [16:10:27] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [16:11:10] (03CR) 10Krinkle: [C: 04-2] "So yeah, none of these are meant to be config globals. This isn't using "wmf" as global variable. This is a local variable to represent th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [16:12:50] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [16:12:56] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10AUgolnikova-WMF) Thank you, it works now! [16:13:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:13:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T300775)', diff saved to https://phabricator.wikimedia.org/P20611 and previous config saved to /var/cache/conftool/dbconfig/20220211-161324-marostegui.json [16:13:25] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, 10User-Ladsgroup: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10AUgolnikova-WMF) 05Open→03Resolved [16:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:30] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [16:13:58] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:14:26] (03CR) 10Cwhite: [C: 03+1] role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:20:50] (03CR) 10Herron: [C: 03+1] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [16:24:38] (03CR) 10Cwhite: [C: 03+1] "There's a lot going on here in one changeset, but the it makes sense considering the diff." [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:32:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host datahubsearch1001.eqiad.wmnet [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:52] 10SRE, 10vm-requests: eqiad: 3 VMs requested for datahub opensearch cluster - https://phabricator.wikimedia.org/T301383 (10BTullis) That all went as planned. ` END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host datahubsearch1001.eqiad.wmnet ` @razzi - can you carry this out for the other two pl... [16:36:02] (03PS1) 10Ladsgroup: ncredir: Update foundation.wm.o redirects to the new website [puppet] - 10https://gerrit.wikimedia.org/r/761947 [16:37:28] (03PS1) 10Elukey: nagios_common: add quotes around check_ores_workers' HOSTADDRESS [puppet] - 10https://gerrit.wikimedia.org/r/761948 [16:39:40] (03CR) 10Vgutierrez: "looks good, do we have a task ID for this commit?" [puppet] - 10https://gerrit.wikimedia.org/r/761947 (owner: 10Ladsgroup) [16:42:14] (03CR) 10Ladsgroup: ncredir: Update foundation.wm.o redirects to the new website (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761947 (owner: 10Ladsgroup) [16:43:03] (03CR) 10Elukey: [C: 03+2] nagios_common: add quotes around check_ores_workers' HOSTADDRESS [puppet] - 10https://gerrit.wikimedia.org/r/761948 (owner: 10Elukey) [16:45:54] (03CR) 10JMeybohm: [C: 03+1] "That should be fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/761938 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [16:48:51] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: decrease CPU use, increase number of nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/761938 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [16:52:50] (03Merged) 10jenkins-bot: changeprop-jobqueue: decrease CPU use, increase number of nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/761938 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [16:53:15] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync on production [16:53:17] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [16:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:45] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [16:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:04] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync on production [16:54:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [16:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:27] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [16:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:28] (03PS1) 10Elukey: profile::icinga::ircbot: send ores-related alarms to #wikimedia-ml [puppet] - 10https://gerrit.wikimedia.org/r/761949 [16:58:04] (03PS1) 10AOkoth: admin: extend daniram access by 6 months [puppet] - 10https://gerrit.wikimedia.org/r/761950 [16:58:30] (03CR) 10Herron: [C: 03+1] "LGTM! Out of curiosity what do you envision the distinction between e.g. 'probe/public' and 'probe/private' being used for in the future?" [puppet] - 10https://gerrit.wikimedia.org/r/761890 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:59:19] (03CR) 10Herron: [C: 03+1] role: reorder prometheus profile inclusion [puppet] - 10https://gerrit.wikimedia.org/r/761910 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:01:00] (03CR) 10Herron: [C: 03+1] hieradata: add probes for "pop services" [puppet] - 10https://gerrit.wikimedia.org/r/761889 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [17:03:44] 10SRE, 10Domains, 10wikimediafoundation.org: Update wikipediafoundation.org redirect - https://phabricator.wikimedia.org/T301574 (10Varnent) [17:04:28] (03PS2) 10Ladsgroup: ncredir: Update foundation.wm.o redirects to the new website [puppet] - 10https://gerrit.wikimedia.org/r/761947 (https://phabricator.wikimedia.org/T301574) [17:05:33] 10SRE, 10Domains, 10wikimediafoundation.org, 10Patch-For-Review: Update wikipediafoundation.org redirect - https://phabricator.wikimedia.org/T301574 (10Varnent) [17:08:15] (03CR) 10Vgutierrez: [C: 03+1] ncredir: Update foundation.wm.o redirects to the new website [puppet] - 10https://gerrit.wikimedia.org/r/761947 (https://phabricator.wikimedia.org/T301574) (owner: 10Ladsgroup) [17:09:04] (03PS3) 10Ladsgroup: ncredir: Update foundation.wm.o redirects to the new website [puppet] - 10https://gerrit.wikimedia.org/r/761947 (https://phabricator.wikimedia.org/T301574) [17:09:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] ncredir: Update foundation.wm.o redirects to the new website [puppet] - 10https://gerrit.wikimedia.org/r/761947 (https://phabricator.wikimedia.org/T301574) (owner: 10Ladsgroup) [17:10:32] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfsclient: rearrange the logic that decides what to mount [puppet] - 10https://gerrit.wikimedia.org/r/761783 (owner: 10Andrew Bogott) [17:10:37] 10SRE, 10Domains, 10wikimediafoundation.org, 10User-Ladsgroup: Update wikipediafoundation.org redirect - https://phabricator.wikimedia.org/T301574 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup It will be there in half an hour (plus some layers of cache) [17:11:52] (03CR) 10Herron: rsyslog: add 00-load_modules.conf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [17:13:27] (03CR) 10Herron: [C: 03+1] profile::icinga::ircbot: send ores-related alarms to #wikimedia-ml [puppet] - 10https://gerrit.wikimedia.org/r/761949 (owner: 10Elukey) [17:13:37] (03CR) 10Elukey: [C: 03+2] profile::icinga::ircbot: send ores-related alarms to #wikimedia-ml [puppet] - 10https://gerrit.wikimedia.org/r/761949 (owner: 10Elukey) [17:22:29] (03CR) 10Cwhite: [C: 03+1] "Since this introducing a layout pattern that should be followed in the future, a linter that catches module imports elsewhere in configs i" [puppet] - 10https://gerrit.wikimedia.org/r/761455 (https://phabricator.wikimedia.org/T292175) (owner: 10Herron) [17:22:45] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Arnoldokoth) 05Open→03In progress [17:29:28] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: update wmcs ntp servers after bullseye updates [puppet] - 10https://gerrit.wikimedia.org/r/761933 (owner: 10Majavah) [17:30:24] (03CR) 10Andrew Bogott: [C: 03+2] O:mariadb::cloudinfra: update comments and motd role [puppet] - 10https://gerrit.wikimedia.org/r/761929 (owner: 10Majavah) [17:34:23] (03CR) 10RLazarus: [C: 03+1] admin: extend daniram access by 6 months [puppet] - 10https://gerrit.wikimedia.org/r/761950 (owner: 10AOkoth) [17:36:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10wiki_willy) ++ @cmooney, who might be able to provide an answer on that. I think he's wrapping things up with testing though, so maybe about another week?... [17:39:50] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 (owner: 10Andrew Bogott) [17:39:58] (03PS11) 10Andrew Bogott: profile::wmcs::nfsclient: Move server mount path into yaml config [puppet] - 10https://gerrit.wikimedia.org/r/761787 [17:40:08] 10SRE, 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10Arnoldokoth) 05Open→03In progress [17:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:42:13] 10SRE, 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10Arnoldokoth) Hi, we have investigated this further and have found that you have two accounts one with a contractor email and the other with a full-time employee email. @Kinneret... [17:51:14] (03CR) 10JMeybohm: [C: 04-1] k8s: add module (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/761297 (https://phabricator.wikimedia.org/T300879) (owner: 10Giuseppe Lavagetto) [17:58:12] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10Dzahn) 05Open→03In progress [17:58:25] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10Dzahn) 05Open→03In progress [17:58:38] (03CR) 10AOkoth: [C: 03+2] admin: extend daniram access by 6 months [puppet] - 10https://gerrit.wikimedia.org/r/761950 (owner: 10AOkoth) [18:07:16] (03PS1) 10Krinkle: wmf-config: Use __DIR__ instead of re-using an unintended global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761963 (https://phabricator.wikimedia.org/T45956) [18:07:18] (03PS1) 10Krinkle: [Beta Cluster] use require_once instead of include for import.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761964 [18:07:20] (03PS1) 10Krinkle: wmf-config: Use __DIR__ instead of "$IP/../wmf-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) [18:07:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) >>! In T294137#7703274, @MatthewVernon wrote: > @Cmjohnson thanks! Do you have an idea how long ms-fe1012 is going to be needed for testing, please?... [18:09:26] (03PS2) 10Krinkle: wmf-config: Use __DIR__ instead of "$IP/../wmf-config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) [18:15:18] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM. Thanks for the heads up. I'll need to do some fixing up when merging into the train-dev branch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761965 (https://phabricator.wikimedia.org/T45956) (owner: 10Krinkle) [18:25:55] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Urbanecm) While adding you likely won't do any harm, I'm wondering about the need to access here. Most of the mails sent to the ops list are useful only for people with some kind of shell access and in my experi... [18:45:55] >upstream connect error or disconnect/reset before headers. reset reason: overflow [18:46:05] cofnirm [18:46:29] looking [18:46:42] Second incident this week. :-/ [18:48:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:48:47] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (104nn1l2) 05Resolved→03Open Not resolved yet. Saw it again at 18:45, 11 February 2022 when trying to save an edit to https://comm... [18:49:04] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10NguoiDungKhongDinhDanh) I encountered it again just now. [18:49:12] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10NguoiDungKhongDinhDanh) I encountered it again just now. [18:50:43] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:54:04] (03PS1) 10Andrew Bogott: labstore nfs-exportd.py: Don't require public exports [puppet] - 10https://gerrit.wikimedia.org/r/761972 (https://phabricator.wikimedia.org/T301280) [18:57:20] (03CR) 10Andrew Bogott: [C: 03+2] labstore nfs-exportd.py: Don't require public exports [puppet] - 10https://gerrit.wikimedia.org/r/761972 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [18:59:11] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:59:55] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:02:11] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:04:56] (03PS1) 10Andrew Bogott: nfs-mounts: move testlabs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/761973 (https://phabricator.wikimedia.org/T301280) [19:05:49] (03PS1) 10Halfak: Adds aspell-hi to ores/manifests/base.pp [puppet] - 10https://gerrit.wikimedia.org/r/761974 (https://phabricator.wikimedia.org/T300195) [19:07:39] (03CR) 10Dzahn: Regularly check MFA status of elevated Phabricator accounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:08:26] (03CR) 10Dzahn: [C: 03+2] "Description-en: Hindi (hi) dictionary for GNU aspell" [puppet] - 10https://gerrit.wikimedia.org/r/761974 (https://phabricator.wikimedia.org/T300195) (owner: 10Halfak) [19:13:59] !log running puppet on all ores machines to install aspell-hi (gerrit:761974) which for some reason was installed on a random subset of ores servers (1002,2001,2005 but not the other 19 ones) T300195 T252581 - after this the package is now installed on 18 servers (1001-1009, 2001-2009) [19:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:07] T300195: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 [19:14:07] T252581: Train and test editquality models for Hindi Wikipedia - https://phabricator.wikimedia.org/T252581 [19:14:14] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Thanks so so much once again, @Dzahn, @jcrespo, @akosiaris, @faidon, hugely appreciated!!!! As before, please don't hesitate to reach out if I can help in any way! :) :) [19:14:41] (03PS1) 10CDanis: text-lb: normalize unused query param [puppet] - 10https://gerrit.wikimedia.org/r/761979 [19:18:59] (03PS2) 10CDanis: text-lb: normalize unused query param [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) [19:19:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [19:19:09] 10SRE, 10Data-Engineering: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Dzahn) [19:19:10] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Dzahn) [19:19:13] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) [19:19:27] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10EBernhardson) Note also that there is a limit set inside elasticsearch, `cluster.indices.recovery.max_bytes_per_sec`. It's set quite low at the moment as... [19:19:29] (03CR) 10Ladsgroup: [C: 04-1] text-lb: normalize unused query param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) (owner: 10CDanis) [19:19:39] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move testlabs to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/761973 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [19:20:33] (03PS3) 10CDanis: text-lb: normalize unused query param [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) [19:20:35] (03CR) 10CDanis: text-lb: normalize unused query param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) (owner: 10CDanis) [19:22:27] (03CR) 10Ladsgroup: [C: 03+1] text-lb: normalize unused query param [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) (owner: 10CDanis) [19:22:33] 10Puppet, 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, and 2 others: Create a cron to clean clientbucket every day or hour - https://phabricator.wikimedia.org/T165885 (10Dzahn) @Paladox Since it's now possible to opt-in to this and then get a timer (see T165885#6585808 if you still want to try i... [19:23:23] (03PS2) 10Dzahn: Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:25:26] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33760/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:35:48] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:35:55] 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10skyenet) [19:36:04] (03PS3) 10Dzahn: phabricator: Regularly check MFA status of elevated Phabricator accounts [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:41:27] (03CR) 10Dzahn: "@Aklapper this is deployed on the phab server and everything worked..except you don't have the needed DB grants yet, after all:" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:41:43] !log removed 16 emails from accounts with deleteUserEmail.php [19:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:06] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10herron) Hi @Zabe, please use the form here for subscription to ops@ https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/ >>! In T301011#7704550, @Urbanecm wrote: > While adding you likely won't... [19:47:34] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Urbanecm) >>! In T301011#7704712, @herron wrote: > [...] >>>! In T301011#7704550, @Urbanecm wrote: >> While adding you likely won't do any harm, I'm wondering about the need to access here. Most of the mails sen... [19:47:44] (03CR) 10Aklapper: "Thanks a lot! Yepp, that's why I wrote "This depends on https://gerrit.wikimedia.org/r/c/operations/puppet/+/759357 " earlier" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:51:34] (03CR) 10CDanis: [C: 03+2] text-lb: normalize unused query param [puppet] - 10https://gerrit.wikimedia.org/r/761979 (https://phabricator.wikimedia.org/T301507) (owner: 10CDanis) [19:51:57] (03CR) 10Dzahn: "Oh, I kind of missed that. I thought we still wanted to just see if we even need the grant change or not. But good either way, I think. ch" [puppet] - 10https://gerrit.wikimedia.org/r/759359 (https://phabricator.wikimedia.org/T299403) (owner: 10Aklapper) [19:54:16] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Dzahn) I can confirm Zabe is on the NDA list and the only rule for being on ops list seems to be .. to have NDA. I wasn't sure if I should recommend using the form, thanks for confirming that is indeed the way. [19:56:10] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: keep the nfs service running [puppet] - 10https://gerrit.wikimedia.org/r/761981 (https://phabricator.wikimedia.org/T291405) [19:59:17] (03PS1) 10CDanis: vcl tests: fix run.py script [puppet] - 10https://gerrit.wikimedia.org/r/761983 [20:00:23] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:05:04] (03PS2) 10Andrew Bogott: profile::wmcs::nfs::standalone: keep the nfs service running [puppet] - 10https://gerrit.wikimedia.org/r/761981 (https://phabricator.wikimedia.org/T291405) [20:05:26] (03CR) 10RLazarus: [C: 03+1] vcl tests: fix run.py script [puppet] - 10https://gerrit.wikimedia.org/r/761983 (owner: 10CDanis) [20:06:04] (03CR) 10CDanis: [C: 03+2] vcl tests: fix run.py script [puppet] - 10https://gerrit.wikimedia.org/r/761983 (owner: 10CDanis) [20:11:42] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10Zabe) >>! In T301011#7704712, @herron wrote: > Hi @Zabe, please use the form here for subscription to ops@ https://lists.wikimedia.org/postorius/lists/ops.lists.wikimedia.org/ > >>>! In T301011#7704550, @Urbane... [20:13:35] 10SRE, 10SRE-Access-Requests: Subscribe Zabe to ops@ - https://phabricator.wikimedia.org/T301011 (10herron) 05In progress→03Resolved a:03herron >>! In T301011#7704769, @Zabe wrote: > I have used the form and apparently someone already approved my request. That was me, welcome to the list :) [20:18:20] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::standalone: keep the nfs service running [puppet] - 10https://gerrit.wikimedia.org/r/761981 (https://phabricator.wikimedia.org/T291405) (owner: 10Andrew Bogott) [20:23:09] (03PS1) 10Dzahn: blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) [20:25:53] (03PS1) 10Cwhite: beta-logs: disable shard rebalancing for duration of curator run [puppet] - 10https://gerrit.wikimedia.org/r/761988 [20:25:55] (03PS1) 10Cwhite: logstash: disable shard rebalancing for duration of curator run [puppet] - 10https://gerrit.wikimedia.org/r/761989 [20:27:11] (03CR) 10jerkins-bot: [V: 04-1] blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:43:40] (03PS1) 10Andrew Bogott: nfs-mounts: move quarry to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/761991 [20:46:00] (03PS2) 10Cwhite: beta-logs: disable shard rebalancing while deleting indices [puppet] - 10https://gerrit.wikimedia.org/r/761988 [20:46:02] (03PS2) 10Cwhite: logstash: disable shard rebalancing while deleting indices [puppet] - 10https://gerrit.wikimedia.org/r/761989 [20:47:13] (03PS4) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) [20:50:45] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1663850 MB (21% inode=74%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [20:52:23] (03CR) 10Cwhite: "Interested in your thoughts on this approach." [puppet] - 10https://gerrit.wikimedia.org/r/761989 (owner: 10Cwhite) [21:07:03] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts: move quarry to a project-local nfs server [puppet] - 10https://gerrit.wikimedia.org/r/761991 (owner: 10Andrew Bogott) [21:12:02] (03PS2) 10Dzahn: blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) [21:14:31] (03CR) 10jerkins-bot: [V: 04-1] blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:21:32] (03PS3) 10Dzahn: blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) [21:35:26] (03CR) 10Herron: logstash: disable shard rebalancing while deleting indices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761989 (owner: 10Cwhite) [21:38:39] (03CR) 10Dzahn: [C: 03+2] blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:42:44] (03Merged) 10jenkins-bot: blubber: copy config for 15.wikipedia into container [container/miscweb] - 10https://gerrit.wikimedia.org/r/761987 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:45:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:46:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:05] (03CR) 10Dzahn: "please add me back when it's ready" [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [21:47:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:54] (03CR) 10Dzahn: contint: Install docker 20.10 from thirdparty/ci on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [22:00:11] (03PS1) 10Dzahn: miscweb: bump staging to 2022-02-11-214428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762000 [22:02:37] (03PS2) 10Dzahn: miscweb: bump staging to 2022-02-11-214428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762000 [22:02:43] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging to 2022-02-11-214428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762000 (owner: 10Dzahn) [22:05:39] (03CR) 10Dzahn: [C: 03+1] aptrepo::files::updates Update gitlab-ce and gitlab-runner to 14.7 [puppet] - 10https://gerrit.wikimedia.org/r/761917 (https://phabricator.wikimedia.org/T300967) (owner: 10Jelto) [22:08:55] (03Merged) 10jenkins-bot: miscweb: bump staging to 2022-02-11-214428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762000 (owner: 10Dzahn) [22:09:58] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [22:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:24] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [22:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:29:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:49] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [22:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:06] (03PS1) 10Bking: Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) [22:46:20] (03CR) 10Dzahn: "this certificate has more than just deployment-elastic11 on it. it has all these:" [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [22:47:14] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [22:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:06] (03PS5) 10Ebernhardson: query_service: Simplify jvm arg handling [puppet] - 10https://gerrit.wikimedia.org/r/761080 [22:51:35] (03CR) 10Bking: Elastic: add deployment-prep cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:01:57] (03CR) 10Dzahn: Elastic: add deployment-prep cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:03:17] (03CR) 10Dzahn: "besides that nitpick the cert looks good by the way, i checked the SANs on it. so consider it +1" [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:12:36] (03PS2) 10Bking: Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) [23:13:13] (03CR) 10jerkins-bot: [V: 04-1] Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:14:08] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/761080 (owner: 10Ebernhardson) [23:18:00] (03PS3) 10Bking: Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) [23:18:51] (03PS4) 10Dzahn: Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:19:15] (03CR) 10Dzahn: [C: 03+1] "lgtm! fixed the jerkins-bot -1 for you, I think." [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:22:04] (03CR) 10Bking: [C: 03+2] Elastic: add deployment-prep cert [puppet] - 10https://gerrit.wikimedia.org/r/762006 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [23:23:58] !log puppet-merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/762006 [23:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:10] inflatador: by the way, it was cool that you posted that question on wikitech-l too, have a good one