[00:26:00] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10KFrancis) @Dzahn The NDA is fully executed. Please proceed with the access request. Happy Holidays to you too! [00:50:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:59:17] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) a:05KFrancis→03Dzahn [01:02:14] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) Thank you @KFrancis that was so quick. @Zabe Just missed you on IRC, let's chat tomorrow how to continue. I was wondering how to deal with real name in the ops repo etc. [01:15:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:50:43] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:38:15] !log restarted zuul on contint2001, was totally stuck. (T298177) [02:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:22] T298177: CI is doing nada (Gearman) - https://phabricator.wikimedia.org/T298177 [02:40:43] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:51:49] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:27:53] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:53:23] (03CR) 10Thcipriani: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [04:04:56] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) Status update, leaving this for the end-of-year break: [[ https://gerrit.wikimedia.org/r/748876 | 748876 ]] and [[ https://ger... [04:29:07] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:44:29] (03PS1) 10Subramanya Sastry: Enable slow-parsoid logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749302 [05:46:35] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1370.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:41:17] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:51:33] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10Joe) >>! In T287130#7585276, @RLazarus wrote: > @joe: My recollection is you were going to take care of the blubber and docker-pkg part... [07:10:55] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:15:17] (03PS3) 10Giuseppe Lavagetto: role::builder: only rebuild images on deneb [puppet] - 10https://gerrit.wikimedia.org/r/749209 [07:18:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33071/console" [puppet] - 10https://gerrit.wikimedia.org/r/749209 (owner: 10Giuseppe Lavagetto) [07:22:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] role::builder: only rebuild images on deneb [puppet] - 10https://gerrit.wikimedia.org/r/749209 (owner: 10Giuseppe Lavagetto) [07:42:23] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:39:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix helmfile.d service examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [08:39:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix helmfile.d service examples (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [08:42:37] (03Merged) 10jenkins-bot: Fix helmfile.d service examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [08:49:12] (03PS1) 10Giuseppe Lavagetto: docker::engine: raise a warning if overlayfs is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/749506 [08:56:20] (03CR) 10Jelto: "sorry for comment after merge but I think Release.Namespace and Release.Name have to be escaped like" [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [09:04:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Joe) a:05Joe→03None [09:05:44] (03PS1) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [09:08:48] (03PS2) 10Giuseppe Lavagetto: docker::engine: raise a warning if overlayfs is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/749506 [09:08:50] (03PS2) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [09:09:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33073/console" [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [09:11:28] (03PS3) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [09:25:29] (03PS1) 10Jelto: ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 [10:02:12] (03CR) 10Kormat: Add MySQL upgrade cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [10:24:17] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [10:24:19] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [10:27:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [10:31:01] (03CR) 10Jbond: Add MySQL upgrade cookbook (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [10:31:34] (03PS7) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [10:34:24] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [10:37:03] (03PS8) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [10:42:03] (03PS2) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 [10:43:43] (03PS5) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [10:44:29] (03PS6) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [10:46:05] (03PS7) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [10:48:27] (03CR) 10Jbond: [C: 04-1] "im going to -1 this as i don't think it is useful for the current use case as the puppet run will probably inter fear with things. howeve" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [10:48:33] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [10:55:43] (03CR) 10Jcrespo: "If this is intended for wmf production, this looks really unsafe (it would break multi instance hosts, doesn't check for mysql to be stopp" [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:06:04] (03CR) 10Marostegui: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:09:37] (03CR) 10JMeybohm: "Just some nits" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [11:10:38] (03CR) 10Jcrespo: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:12:46] (03CR) 10JMeybohm: [C: 03+1] imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [11:13:56] (03CR) 10Jbond: "FYI the base class implements most the nits i have added and i think that with minimal changes (some way to stop the puppet run) to the we" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [11:14:08] (03PS7) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [11:14:10] (03PS9) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [11:14:13] (03PS1) 10Arturo Borrero Gonzalez: wmcs: openstack API: introduce server_exists() function and use it [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749519 [11:15:24] (03CR) 10Jbond: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [11:15:44] (03CR) 10JMeybohm: [C: 03+1] Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 (owner: 10RLazarus) [11:17:07] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:17:09] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:20:04] (03PS8) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [11:20:05] (03PS10) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [11:23:35] (03PS1) 10Kormat: setup.cfg: Run mypy against unit tests too. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749522 [11:25:39] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658 (10Majavah) 05Open→03Resolved This was done some point as `mediaw... [11:25:53] (03PS2) 10Kormat: setup.cfg: Run mypy against unit tests too. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749522 [11:28:01] (03CR) 10Kormat: [C: 03+2] setup.cfg: Run mypy against unit tests too. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749522 (owner: 10Kormat) [11:29:43] (03Merged) 10jenkins-bot: setup.cfg: Run mypy against unit tests too. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749522 (owner: 10Kormat) [11:33:48] (03PS4) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [11:33:50] (03PS1) 10Giuseppe Lavagetto: profile::base: introduce class memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/749523 [11:33:52] (03PS11) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [11:33:54] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: remove_instance: introduce new argument to disable dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749524 [11:35:07] (03PS3) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 [11:37:24] (03PS1) 10Kormat: wmfdb/mysql_cli: Add docstrings. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749525 [11:38:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33076/console" [puppet] - 10https://gerrit.wikimedia.org/r/749523 (owner: 10Giuseppe Lavagetto) [11:39:12] (03CR) 10Kormat: [C: 03+2] wmfdb/mysql_cli: Add docstrings. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749525 (owner: 10Kormat) [11:39:47] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: be less aggressive with removing dead config [puppet] - 10https://gerrit.wikimedia.org/r/749526 [11:40:37] (03Merged) 10jenkins-bot: wmfdb/mysql_cli: Add docstrings. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749525 (owner: 10Kormat) [11:41:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack API: introduce server_exists() function and use it [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749519 (owner: 10Arturo Borrero Gonzalez) [11:41:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:41:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: remove_instance: introduce new argument to disable dologmsg [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749524 (owner: 10Arturo Borrero Gonzalez) [11:41:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [11:41:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: be less aggressive with removing dead config [puppet] - 10https://gerrit.wikimedia.org/r/749526 (owner: 10Arturo Borrero Gonzalez) [11:41:53] (03PS8) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [11:41:55] (03PS1) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [11:44:43] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:44:45] (03CR) 10jerkins-bot: [V: 04-1] sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [11:46:41] (03CR) 10Volans: "I did a pass, looks good to me in general, I have left some general question/comment inline and I think there is a case not covered." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 (owner: 10Kormat) [11:47:17] (03PS9) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [11:49:20] (03PS2) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [11:49:46] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:49:51] (03PS10) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [11:50:09] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Urbanecm) >>! In T297323#7585099, @Dzahn wrote: > Thank you @KFrancis that was so quick. > > @Zabe Just missed you on IRC, let's chat tomorrow how to continue. I was wondering how to deal with real... [11:55:45] (03PS3) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [11:56:02] (03PS11) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [11:57:49] (03CR) 10Jbond: "With the update to the base class in the preceding patch i think this is now equivalent to 749176 with the additional benefits that the ba" [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:58:21] (03CR) 10jerkins-bot: [V: 04-1] sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [11:58:40] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [11:59:03] (03CR) 10Volans: "quick pass" [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [12:00:48] (03CR) 10Jbond: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [12:01:09] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add reconfigure cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749531 [12:01:24] (03PS4) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [12:01:27] (03PS3) 10Giuseppe Lavagetto: mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 [12:01:30] (03PS12) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [12:01:32] (03PS4) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 [12:02:02] (03CR) 10Jbond: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [12:02:57] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 (owner: 10Giuseppe Lavagetto) [12:04:24] (03CR) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. (035 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 (owner: 10Kormat) [12:08:09] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add reconfigure cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749531 [12:08:47] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) Yes, that is exactly what I meant. It's an option if you (Zabe) want to keep your real name out of the admin.yaml. [12:10:38] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add reconfigure cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749531 [12:11:21] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Zabe) I have no problem with my real name showing up in data.yaml. It is visible in my git commits anyway. [12:13:40] (03PS4) 10Giuseppe Lavagetto: mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 [12:14:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: add reconfigure cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749531 (owner: 10Arturo Borrero Gonzalez) [12:15:26] (03PS5) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [12:16:17] (03PS1) 10Dzahn: admin: add Zabe to ldap_only_admins for volunteer NDA group [puppet] - 10https://gerrit.wikimedia.org/r/749534 (https://phabricator.wikimedia.org/T297323) [12:17:19] (03CR) 10Dzahn: [C: 03+2] admin: add Zabe to ldap_only_admins for volunteer NDA group [puppet] - 10https://gerrit.wikimedia.org/r/749534 (https://phabricator.wikimedia.org/T297323) (owner: 10Dzahn) [12:31:14] (03CR) 10Jcrespo: ":-)" [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [12:31:29] !log LDAP added uid=zabe to group nda (T297323) [12:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:35] T297323: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 [12:33:24] (03CR) 10Volans: "Very first pass for early feedback as requested" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [12:33:33] (03PS6) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [12:34:46] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) @Zabe Alright, done! You have been added. You should have Logstash access now. (and other things that come with the nda group) [12:34:59] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) 05In progress→03Resolved [12:35:00] (03PS7) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [12:35:05] (03CR) 10Jbond: "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [12:35:12] (03PS13) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [12:37:31] (03CR) 10jerkins-bot: [V: 04-1] sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [12:37:45] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [12:38:09] (03CR) 10Jbond: [C: 04-1] "-1 to per jynus comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [12:38:33] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:01] (03PS8) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 [12:54:18] (03CR) 10Jcrespo: "Hey, following Manuel's advice that the buffer pool shouldn't be loaded on upgrade-" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [13:19:42] (03CR) 10Volans: [C: 03+1] "LGTM besides the issue with inline comments. Feel free to fix it in a separate patch." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 (owner: 10Kormat) [13:22:31] (03PS3) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [13:23:47] (03CR) 10Jbond: WIP: add reposync (0312 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [13:27:04] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [13:28:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [13:29:57] (03CR) 10Jbond: sre:SREBatchBase: Wrap everything in icinga downtime (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749527 (owner: 10Jbond) [14:07:51] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:25:10] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Majavah) [14:37:21] (03CR) 10Volans: [C: 03+1] "If the compiler is happy LGTM, minor optional comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [14:38:55] (03CR) 10Volans: [C: 03+1] "LGTM, see inline for minor comment" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [14:39:50] (03CR) 10Volans: [C: 03+1] "replying to myself" [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [14:40:51] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:43:13] (03CR) 10Jbond: reposync: add initial repo sync class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [14:43:14] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10bking) a:05bking→03None [14:45:39] (03CR) 10Jbond: reposync: add initial repo sync class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [14:47:37] (03PS5) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 [14:51:10] (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Add support for parsing my.cnf files. (033 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 (owner: 10Kormat) [14:51:41] (03PS1) 10Kormat: Use pathlib instead of str for file paths. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 [14:52:41] (03Merged) 10jenkins-bot: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 (owner: 10Kormat) [14:54:14] !log mforns@deploy1002 Started deploy [analytics/refinery@fcf104e]: Adhoc train for anomaly detection queries [analytics/refinery@fcf104e] [14:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 (owner: 10Giuseppe Lavagetto) [14:54:42] (03PS2) 10Jelto: ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 [14:55:50] (03CR) 10Jelto: ssh-config: add config for gitlab.wikimedia.org (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [14:57:47] (03Merged) 10jenkins-bot: mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 (owner: 10Giuseppe Lavagetto) [14:59:21] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [15:01:55] (03CR) 10Volans: [C: 03+1] "LGTM, possible nit improvement inline" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 (owner: 10Kormat) [15:03:56] (03CR) 10JMeybohm: "You'll also need to bump the debian/changelog (`dch -i`) to actually build a new version later" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [15:05:21] (03PS2) 10Kormat: Use pathlib instead of str for file paths. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 [15:05:48] (03CR) 10Kormat: Use pathlib instead of str for file paths. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 (owner: 10Kormat) [15:10:23] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:11:55] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:14:13] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:14:53] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:15:40] !log mforns@deploy1002 Finished deploy [analytics/refinery@fcf104e]: Adhoc train for anomaly detection queries [analytics/refinery@fcf104e] (duration: 21m 25s) [15:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:57] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:17:31] !log mforns@deploy1002 Started deploy [analytics/refinery@fcf104e] (thin): Adhoc train for anomaly detection queries THIN [analytics/refinery@fcf104e] [15:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:39] !log mforns@deploy1002 Finished deploy [analytics/refinery@fcf104e] (thin): Adhoc train for anomaly detection queries THIN [analytics/refinery@fcf104e] (duration: 00m 07s) [15:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:54] !log mforns@deploy1002 Started deploy [analytics/refinery@fcf104e] (hadoop-test): Adhoc train for anomaly detection queries TEST [analytics/refinery@fcf104e] [15:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:01] (03PS3) 10Jelto: ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 [15:21:05] (03CR) 10Jelto: ssh-config: add config for gitlab.wikimedia.org (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [15:24:38] !log mforns@deploy1002 Finished deploy [analytics/refinery@fcf104e] (hadoop-test): Adhoc train for anomaly detection queries TEST [analytics/refinery@fcf104e] (duration: 06m 44s) [15:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:55] (03PS3) 10Giuseppe Lavagetto: docker::engine: raise a warning if overlayfs is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/749506 [15:27:57] (03PS2) 10Giuseppe Lavagetto: profile::base: introduce class memory_cgroup [puppet] - 10https://gerrit.wikimedia.org/r/749523 [15:27:59] (03PS5) 10Giuseppe Lavagetto: deployment_server: add docker engine [puppet] - 10https://gerrit.wikimedia.org/r/749508 (https://phabricator.wikimedia.org/T297673) [15:29:10] (03CR) 10JMeybohm: [C: 03+1] docker::engine: raise a warning if overlayfs is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/749506 (owner: 10Giuseppe Lavagetto) [15:29:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::engine: raise a warning if overlayfs is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/749506 (owner: 10Giuseppe Lavagetto) [15:31:18] (03PS1) 10JHathaway: Revert "mirrors: revert to sodium mirror temporarily" [dns] - 10https://gerrit.wikimedia.org/r/749547 [15:32:52] (03CR) 10JHathaway: [C: 03+2] Revert "mirrors: revert to sodium mirror temporarily" [dns] - 10https://gerrit.wikimedia.org/r/749547 (owner: 10JHathaway) [15:37:24] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@f1522be]: (no justification provided) [15:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:30] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@f1522be]: (no justification provided) (duration: 00m 06s) [15:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10DAbad) Approved [15:51:33] (03CR) 10JMeybohm: [C: 03+1] ssh-config: add config for gitlab.wikimedia.org [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/749509 (owner: 10Jelto) [16:10:07] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:11:32] (03CR) 10Kormat: [C: 03+2] Use pathlib instead of str for file paths. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 (owner: 10Kormat) [16:13:02] (03Merged) 10jenkins-bot: Use pathlib instead of str for file paths. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749542 (owner: 10Kormat) [16:16:57] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:47] (03PS1) 10Giuseppe Lavagetto: profile:k8s::deployment_server::mediawiki: split in subprofiles [puppet] - 10https://gerrit.wikimedia.org/r/749552 (https://phabricator.wikimedia.org/T297673) [16:19:49] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) [16:21:36] (03CR) 10jerkins-bot: [V: 04-1] kubernetes::deployment_server::mediawiki: add builder user/role [puppet] - 10https://gerrit.wikimedia.org/r/749553 (https://phabricator.wikimedia.org/T297673) (owner: 10Giuseppe Lavagetto) [16:23:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [16:32:26] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: reduce one indentation level [puppet] - 10https://gerrit.wikimedia.org/r/749557 [16:32:28] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 [16:32:30] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: format with black -l120 [puppet] - 10https://gerrit.wikimedia.org/r/749559 [16:33:21] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 (owner: 10Arturo Borrero Gonzalez) [16:33:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: reduce one indentation level [puppet] - 10https://gerrit.wikimedia.org/r/749557 (owner: 10Arturo Borrero Gonzalez) [16:33:42] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configurator: format with black -l120 [puppet] - 10https://gerrit.wikimedia.org/r/749559 (owner: 10Arturo Borrero Gonzalez) [16:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [16:43:37] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 [16:43:39] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: format with black -l120 [puppet] - 10https://gerrit.wikimedia.org/r/749559 [16:44:32] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 (owner: 10Arturo Borrero Gonzalez) [16:44:34] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configurator: format with black -l120 [puppet] - 10https://gerrit.wikimedia.org/r/749559 (owner: 10Arturo Borrero Gonzalez) [16:45:14] 10SRE, 10ops-codfw, 10cloud-services-team (Kanban): Rack new cloud-dev servers in same rack - https://phabricator.wikimedia.org/T267662 (10Papaul) 05Open→03Resolved a:03Papaul [16:47:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:45] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 [16:48:47] (03PS3) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: format with black -l120 [puppet] - 10https://gerrit.wikimedia.org/r/749559 [16:49:32] (03PS4) 10Arturo Borrero Gonzalez: sonofgridengine: grid_configurator: format with black -l100 [puppet] - 10https://gerrit.wikimedia.org/r/749559 [16:49:41] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: grid_configurator: format with black -l100 [puppet] - 10https://gerrit.wikimedia.org/r/749559 (owner: 10Arturo Borrero Gonzalez) [16:49:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: wrap subproccess calls [puppet] - 10https://gerrit.wikimedia.org/r/749558 (owner: 10Arturo Borrero Gonzalez) [16:50:37] (03PS1) 10Jcrespo: mediabackups: Add minio port to ipv6 connections [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) [16:50:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:50:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:19] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Add minio port to ipv6 connections [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:51:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: grid_configurator: format with black -l100 [puppet] - 10https://gerrit.wikimedia.org/r/749559 (owner: 10Arturo Borrero Gonzalez) [16:53:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:50] (03PS2) 10Jcrespo: mediabackups: Add minio port to ipv6 connections [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) [16:58:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:01:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:55] PROBLEM - Check systemd state on backup1005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:02:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:05] RECOVERY - Check systemd state on backup1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:23] <_joe_> dancy: did you trigger a rebuild of the mw images? [17:04:39] <_joe_> I'm trying to figure out why a release happened now [17:04:42] Yes, while testing T298165 [17:04:42] T298165: mediawiki-multiversion image builder should also poll private and security patches git repositories - https://phabricator.wikimedia.org/T298165 [17:04:54] <_joe_> oh sweet [17:05:01] 50% complete [17:05:13] <_joe_> do you want me to lock the deployments while you run your tests? [17:05:54] No, it's ok. I'm done w/ testing for the moment. Looking into how to deal w/ private settings now. [17:06:58] (03PS3) 10Jcrespo: mediabackups: Add minio port to ipv6 connections [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) [17:07:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:08:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:33] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:09:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:26] (03PS5) 10JMeybohm: Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) [17:16:16] (03CR) 10Ladsgroup: "Wow I slept and woke up and this CR ballooned to way bigger than what I expected. It is mostly my fault that I needed to communicate why I" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [17:23:46] (03CR) 10Dzahn: [C: 03+2] "has approval" [puppet] - 10https://gerrit.wikimedia.org/r/749270 (https://phabricator.wikimedia.org/T298124) (owner: 10Dzahn) [17:23:52] (03PS2) 10Dzahn: admin: add Luke Bowmaker to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/749270 (https://phabricator.wikimedia.org/T298124) [17:25:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) [17:28:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) ` info: Applying configuration version '(f65d06ffe5) Dzahn - admin: add Luke Bowmaker to analytics-platform-eng-admins' Notice: /Stage[main]... [17:30:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @DAbad Thank you. Done! Is it ok if we also add you as the approver for the 2 groups "analytics-platform-eng-admins" and "platform engineer... [17:35:38] !log removing one file for legal compliance [17:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:12] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:17] (03PS1) 10Ahmon Dancy: rsync::quickdatacopy: Allow dest_path to be supplied [puppet] - 10https://gerrit.wikimedia.org/r/749563 [17:42:44] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:31] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) JUNOS version updated JUNOS 20.4R3-S1.3 built 2021-11-20 10:32:03 UTC [17:45:39] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:03] (03PS2) 10Ahmon Dancy: rsync::quickdatacopy: Allow dest_path to be supplied [puppet] - 10https://gerrit.wikimedia.org/r/749563 [17:49:01] (03CR) 10Ahmon Dancy: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/33081/releases1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/749563 (owner: 10Ahmon Dancy) [17:50:09] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 2505 MB (3% inode=74%): /tmp 2505 MB (3% inode=74%): /var/tmp 2505 MB (3% inode=74%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [17:51:16] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:18] (03PS3) 10Ahmon Dancy: rsync::quickdatacopy: Allow dest_path to be supplied [puppet] - 10https://gerrit.wikimedia.org/r/749563 [17:58:20] (03PS1) 10Ahmon Dancy: profile::releases::mediawiki::private: Enable timer and alter target directory [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) [18:04:51] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) @ayounsi msw2 connected to scs-a8 port 44 msw1 connected to scs-a8 port 40 added the QSFP+ modules to both servers and connected with a 2M LC-LC fiber. There is a plain-text password f... [18:09:41] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:11:02] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [18:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:17] (03CR) 10Dzahn: [C: 03+1] "lgtm (but have to compiled yet)" [puppet] - 10https://gerrit.wikimedia.org/r/749563 (owner: 10Ahmon Dancy) [18:14:34] (03CR) 10Dzahn: "This seems to mean primary and secondary deployment servers will be different. Can't we just enable auto syncing and keep the path the sam" [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [18:16:37] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [18:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:23] (03CR) 10Ahmon Dancy: [C: 04-1] "needs revision" [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [18:19:47] (03PS2) 10Ahmon Dancy: profile::releases::mediawiki::private: Enable timer and alter target directory [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) [18:24:00] (03CR) 10Ahmon Dancy: [C: 03+1] "This change should only affect releases1002.eqiad.wmnet and releases2002.codfw.wmnet, not the deployment servers. The purpose of the affe" [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [18:24:18] (03CR) 10Ahmon Dancy: [C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/33083/releases1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [18:27:22] (03CR) 104nn1l2: [C: 04-1] "You also need to designate sysops as those who can remove such userrights at the wgRemoveGroups section. Now admins can only grant such ri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (owner: 10MdsShakil) [18:42:42] !log T297735 removing/banning elastic1039 and elastic1043 from all EQIAD prod clusters [18:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:48] T297735: elastic1043.eqiad.wmnet stuck in power off state - https://phabricator.wikimedia.org/T297735 [18:50:52] 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: drmrs: tata transit 1243318 xconnect hotcut - https://phabricator.wikimedia.org/T298208 (10RobH) p:05Triage→03Medium [18:51:35] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10Dzahn) >>! In T287130#7585276, @RLazarus wrote: > - Regularly rsync the database from the active host to the passive one Hello @RLazaru... [18:52:53] (03CR) 10MdsShakil: T298187 (test) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (owner: 10MdsShakil) [18:57:41] (03CR) 10Jcrespo: "When I add the generated rules to ferm manually, the whole doesn't open for the ipv6 version, what am I doing wrong?" [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [18:59:24] (03CR) 10Dzahn: gitlab_runner: use config template for registering new runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [18:59:37] (03CR) 104nn1l2: [C: 04-1] "The code is good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (owner: 10MdsShakil) [18:59:43] (03CR) 10Majavah: "With recent (I believe buster+) versions of ferm, you should be able to leave out the record type entirely (@resolve((foo-host.example bar" [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [19:00:13] (03CR) 10Dzahn: [C: 03+1] "certainly needs testing and not a quick merge, but it looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [19:05:11] (03CR) 10Dzahn: "The only thing I know that would need to talk from prod to cloud would be CI/jenkins -> CI/gitlab. Besides that I am not aware of other pr" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [19:05:51] (03PS4) 10MdsShakil: Create autopatroller and patroller groups on bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) [19:06:23] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Legoktm) https://github.com/jasonaowen/gpg-expiration/blob/main/gpg-expiration.py looks like it could be copied for the key expiry checking. I'm not a huge fan of switching to non-expiring keys just because it's nice... [19:06:35] (03CR) 10Dzahn: "@Brennen @Jelto adding you because it seems it would affect the gitlab-runner situation (and current jenkins-slaves)" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [19:06:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_netflow_hourly.service,eventlogging_to_druid_prefupdate_hourly.service,prometheus_puppet_agent_stats.service,refine_event_sanitized_analytics_immediate.service,refine_event_sanitized_main_immediate.serv [19:06:41] rtupdater-browser.service,reportupdater-reference-previews.service,reportupdater-templatedata.service,reportupdater-templatewizard.service,reportupdater-visualeditor.service,reportupdater-wmcs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:29] (03CR) 10Yahya: [C: 03+1] Create autopatroller and patroller groups on bnwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [19:11:54] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [19:12:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) 05In progress→03Resolved [19:14:00] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Dzahn) p:05Triage→03Medium [19:14:37] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Dzahn) p:05Triage→03High [19:14:56] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Dzahn) p:05High→03Medium [19:15:03] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [19:15:38] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Dzahn) p:05Triage→03Medium [19:16:55] 10SRE: Research improvements to Pwstore process - https://phabricator.wikimedia.org/T298194 (10Dzahn) I chatted with Moritz and he is suggesting the "keys that never expire" route but also has plans to replace pwstore entirely. Recommend to chat with him before starting code to monitor expiry in pwstore. [19:16:59] (03PS3) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [19:17:20] (03CR) 10Ladsgroup: Add MySQL upgrade cookbook (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [19:19:39] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [19:19:55] (03CR) 10Jcrespo: ":-(" [puppet] - 10https://gerrit.wikimedia.org/r/749561 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [19:21:32] (03PS4) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [19:24:27] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [19:37:26] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [19:40:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:44] (03CR) 10Dzahn: "thinking about it maybe also DBA because somehow the sanitized DB dumps must get from prod into cloud but I don't know how that works" [puppet] - 10https://gerrit.wikimedia.org/r/677872 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [19:53:13] (03PS5) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [19:59:04] (03CR) 10Ahmon Dancy: [C: 03+1] profile::releases::mediawiki::private: Enable timer and alter target directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/749566 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [20:26:34] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) So, I have been debugging this again and summary is: the chain here is (traffic layer) -> envoy (443) -> ngin... [20:33:23] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 2 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) Next steps: * Decide on which database name(s) we need on the x2 cluster. * Create them. * Try connectin... [20:40:25] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) >>! In T266509#7195286, @ssastry wrote: > looks like something is intercepting requests to parsoid-rt-tests.wiki... [20:42:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Volans) [20:43:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Volans) @Cmjohnson @Jclark-ctr All done from my side. BIOS/iDRAC should be all set up. It was done via the new cookbook. If you can... [20:44:05] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [20:47:24] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) Hey traffic, I added you to this ticket because I think a line in varnish config above, the one that handles URLs with "static" in... [20:51:21] (03CR) 104nn1l2: [C: 03+1] "LGTM now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749244 (https://phabricator.wikimedia.org/T298187) (owner: 10MdsShakil) [20:53:36] (03PS1) 10Dzahn: add parsoid-rt-tests.wikimedia.org to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) [20:55:12] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/749575 [20:55:26] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/749575 (owner: 10Volans) [20:55:58] (03CR) 10Dzahn: "@subbu @arlolra Aside from the issue with the ./static/ files, assuming it fixes this, would you prefer "normal" caching by varnish for pa" [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [20:57:43] (03CR) 10Dzahn: "@traffic the goal here is that URLs containing "./static/" are NOT treated in a special way for this site. to make this work: https://pars" [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [21:01:35] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/749575 (owner: 10Volans) [21:02:39] (03CR) 10Dzahn: "this 404 does not come from the backend: https://parsoid-rt-tests.wikimedia.org/static/style.css" [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [21:04:22] (03PS1) 10Volans: Upstream release v1.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/749576 [21:06:39] (03CR) 10Volans: [C: 03+2] Upstream release v1.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/749576 (owner: 10Volans) [21:12:32] (03CR) 10Subramanya Sastry: [C: 03+1] add parsoid-rt-tests.wikimedia.org to alternate_domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [21:12:50] (03Merged) 10jenkins-bot: Upstream release v1.1.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/749576 (owner: 10Volans) [21:27:39] !log uploaded spicerack_1.1.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [21:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:03] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:10] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) a:05Cmjohnson→03ayounsi [22:30:33] (03PS1) 10Volans: quotereviewer: add compatibility with new format [software] - 10https://gerrit.wikimedia.org/r/749588 [22:33:40] (03CR) 10RobH: [C: 03+1] "works for me on the new quotes!" [software] - 10https://gerrit.wikimedia.org/r/749588 (owner: 10Volans) [22:34:26] (03CR) 10Volans: [C: 03+2] quotereviewer: add compatibility with new format [software] - 10https://gerrit.wikimedia.org/r/749588 (owner: 10Volans) [22:34:58] (03Merged) 10jenkins-bot: quotereviewer: add compatibility with new format [software] - 10https://gerrit.wikimedia.org/r/749588 (owner: 10Volans) [23:01:31] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - bking@cumin1001 - T297986 [23:01:31] (03PS1) 10Volans: sre.hosts.provision: fix boot order [cookbooks] - 10https://gerrit.wikimedia.org/r/749590 [23:01:33] (03PS1) 10Volans: sre.hosts.provision: check if changes were applied [cookbooks] - 10https://gerrit.wikimedia.org/r/749591 [23:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:37] T297986: [Tracking task] Pair with brian king on various operational tasks - https://phabricator.wikimedia.org/T297986 [23:14:28] (03CR) 10David Caro: sonofgridengine: grid_configurator: wrap subproccess calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/749558 (owner: 10Arturo Borrero Gonzalez) [23:28:31] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2013, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Lost connection to MySQL server at reading authorization packet, system error: 104 Connection reset by peer https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:33:03] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica