[00:00:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [00:00:05] brennen: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211210T0000). [00:00:05] seddon, cjming, and James_F: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:14] o/ [00:00:24] o/ [00:00:53] o/ [00:03:05] I can deploy if brennen is busy? [00:03:33] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [00:03:52] James_F: we're all in our deployers meeting, we'll get it going Soon :) [00:04:03] Sure sure, I'll wait. :-) [00:05:19] (03PS1) 10Clare Ming: Search_result_page_id should be integer [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745646 (https://phabricator.wikimedia.org/T297400) [00:05:44] (03CR) 10Clare Ming: [C: 03+2] Search_result_page_id should be integer [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745646 (https://phabricator.wikimedia.org/T297400) (owner: 10Clare Ming) [00:06:06] o/ (late) [00:06:47] (03CR) 10Clare Ming: [C: 03+2] Update WebABTestEnrollment name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [00:07:15] (03CR) 10Clare Ming: [C: 03+2] Update A/B test enrollment name [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745607 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [00:07:27] (03Merged) 10jenkins-bot: Update WebABTestEnrollment name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745598 (https://phabricator.wikimedia.org/T295972) (owner: 10Clare Ming) [00:09:51] (03PS1) 10Dzahn: delete cescrout role and profile [puppet] - 10https://gerrit.wikimedia.org/r/745627 (https://phabricator.wikimedia.org/T272559) [00:09:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:55] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745598|Update WebABTestEnrollment name (T295972)]] (duration: 00m 57s) [00:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:00] T295972: Deploy sticky header to office wiki and test wiki - https://phabricator.wikimedia.org/T295972 [00:18:07] (03PS3) 10Clare Ming: Revert "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745385 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [00:19:34] (03CR) 10Clare Ming: [C: 03+2] Revert "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745385 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [00:20:22] (03Merged) 10jenkins-bot: Revert "VE on zh.wiki: Enable single-edit-tab mode" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745385 (https://phabricator.wikimedia.org/T296269) (owner: 10Esanders) [00:21:28] (03PS1) 10Dzahn: annualreport: switch git::clone latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745628 (https://phabricator.wikimedia.org/T218900) [00:21:48] James_F: can you check your patch on mwdebug1002? [00:22:07] Looking. [00:22:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:22:34] cjming: Yeah, LGTM. [00:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:40] cool - syncing [00:23:39] Seddon: still waiting on your stuff to merge [00:23:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:58] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745385|Revert "VE on zh.wiki: Enable single-edit-tab mode" (T296269)]] (duration: 00m 56s) [00:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:03] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [00:24:06] James_F: live! [00:25:23] cjming: Brilliant, thank you. [00:25:29] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32946/miscweb1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/745628 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [00:27:03] (03Merged) 10jenkins-bot: Search_result_page_id should be integer [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745646 (https://phabricator.wikimedia.org/T297400) (owner: 10Clare Ming) [00:27:05] (03Merged) 10jenkins-bot: Update A/B test enrollment name [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745607 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [00:28:56] (03CR) 10Dzahn: "This is a noop on miscweb1002/2002. It doesn't mean all these are that easy because they have a social component where there still actual " [puppet] - 10https://gerrit.wikimedia.org/r/745628 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [00:29:27] (03PS1) 10Ebernhardson: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [00:30:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:19] Seddon: can you check mwdebug1002? [00:32:05] (03CR) 10jerkins-bot: [V: 04-1] Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [00:32:51] cjming: looks good! [00:33:00] great - syncing [00:33:27] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.12/skins/Vector: Backport: [[gerrit:745607|Update A/B test enrollment name (T292587)]] (duration: 00m 56s) [00:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:32] T292587: Sticky header: Create A/B test schema and tie to sticky header feature - https://phabricator.wikimedia.org/T292587 [00:34:35] !log cjming@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/MediaSearch/resources/components/QuickView.vue: Backport: [[gerrit:745646|Search_result_page_id should be integer (T297400)]] (duration: 00m 55s) [00:34:38] Seddon: live! [00:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:40] T297400: '.search_result_page_id' should be integer - https://phabricator.wikimedia.org/T297400 [00:35:21] (03PS1) 10Dzahn: bienvenida.wikimedia.org: switch git::clone from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745631 (https://phabricator.wikimedia.org/T218900) [00:36:08] (03CR) 10Dzahn: "same here, this seems safe because https://gerrit.wikimedia.org/r/q/project:wikimedia/campaigns/eswiki-2018" [puppet] - 10https://gerrit.wikimedia.org/r/745631 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [00:36:20] !log end of UTC late backport & config window [00:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:39] (03CR) 10Dzahn: [C: 03+2] bienvenida.wikimedia.org: switch git::clone from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745631 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [00:37:27] (03CR) 10Ssingh: [C: 03+2] delete cescrout role and profile [puppet] - 10https://gerrit.wikimedia.org/r/745627 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [00:37:30] (03CR) 10RLazarus: [C: 03+1] "Makes sense! I guess we'll want some documentation somewhere on how to update this manually, maybe under https://wikitech.wikimedia.org/wi" [puppet] - 10https://gerrit.wikimedia.org/r/745628 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [00:38:54] (03PS2) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [00:40:52] (03CR) 10Scardenasmolinar: [C: 03+1] [beta] Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [00:49:47] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:55] (03PS1) 10Dzahn: wdqs: switch GUI deployment from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) [01:03:23] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:07:17] (03CR) 10Dzahn: "Know who is actually deploying changes here? Do you mind one extra git pull and needing shell?" [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [01:13:03] PROBLEM - WDQS high update lag on wdqs1012 is CRITICAL: 6.4e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:13:27] (03PS1) 10Dzahn: transparency.wikimedia.org: switch git::clone from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745637 (https://phabricator.wikimedia.org/T218900) [01:18:34] (03CR) 10Dzahn: [C: 03+2] transparency.wikimedia.org: switch git::clone from latest to present [puppet] - 10https://gerrit.wikimedia.org/r/745637 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [01:21:28] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) [01:25:09] 10SRE, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/745637 as part of T218900 means in the unlikely event that you need futu... [02:05:05] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:39] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:05:39] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:05:39] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:05:43] PROBLEM - aqs endpoints health on aqs1010 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:05:43] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:05:43] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:05:45] PROBLEM - aqs endpoints health on aqs1013 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:05:45] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:05:45] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:05:49] PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:06:01] PROBLEM - aqs endpoints health on aqs1014 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:06:01] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:06:01] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:02] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:06:03] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:06:03] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:09] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [02:06:09] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views retur [02:06:09] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) is CRITICAL: https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [02:06:25] RECOVERY - WDQS high update lag on wdqs1012 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.078e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:06:41] PROBLEM - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.204 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:31:45] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:55] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:38:25] PROBLEM - Check systemd state on aqs1011 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [03:01:19] RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:05:43] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:07:59] PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:01:35] RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:08:19] PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:47:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:47:52] (03PS1) 10Samwilson: Enable Disambiguator notifications on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745670 (https://phabricator.wikimedia.org/T297175) [05:50:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:54:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:54:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:55:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:00] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Marostegui) That would work for me indeed [06:14:50] (03PS1) 10Marostegui: pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/745671 (https://phabricator.wikimedia.org/T295965) [06:15:44] (03CR) 10Marostegui: [C: 03+2] pc2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/745671 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:19:20] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add logging on file [software] - 10https://gerrit.wikimedia.org/r/744850 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [06:21:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 11 hosts with reason: Maintenance [06:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 11 hosts with reason: Maintenance [06:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:47] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:30:59] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 332, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [07:32:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:32:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:33:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:33:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18083 and previous config saved to /var/cache/conftool/dbconfig/20211210-073342-marostegui.json [07:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18084 and previous config saved to /var/cache/conftool/dbconfig/20211210-073520-marostegui.json [07:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:52] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:50:10] (03PS1) 10Muehlenhoff: Remove access for christinedk [puppet] - 10https://gerrit.wikimedia.org/r/745726 [07:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18085 and previous config saved to /var/cache/conftool/dbconfig/20211210-075024-marostegui.json [07:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for christinedk [puppet] - 10https://gerrit.wikimedia.org/r/745726 (owner: 10Muehlenhoff) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211210T0800) [08:05:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P18086 and previous config saved to /var/cache/conftool/dbconfig/20211210-080529-marostegui.json [08:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:33] (03PS3) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) [08:12:45] (03CR) 10Elukey: admin_ng: refactor istio helmfile config to allow egress gateways (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:13:11] !log drain primary/secondary instance off ganeti2008 T296622 [08:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:17] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [08:18:14] 10SRE, 10DBA, 10observability, 10Patch-For-Review, 10User-Ladsgroup: Send metrics of db errors of mediawiki to promethues - https://phabricator.wikimedia.org/T297435 (10fgiunchedi) Yes once you have logs in elasticsearch you can turn search queries into Prometheus metrics, from there you have dashboards... [08:20:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18087 and previous config saved to /var/cache/conftool/dbconfig/20211210-082034-marostegui.json [08:20:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:20:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:40] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18088 and previous config saved to /var/cache/conftool/dbconfig/20211210-082041-marostegui.json [08:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] (03CR) 10JMeybohm: [C: 03+1] admin_ng: refactor istio helmfile config to allow egress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/743438 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [08:25:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18089 and previous config saved to /var/cache/conftool/dbconfig/20211210-082520-marostegui.json [08:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:38] RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:32:25] (03PS8) 10Filippo Giunchedi: Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [08:33:02] topranks: ^ I've adjusted the test time series to have more data, and the evaluation time of the test to be further (10m) for rate([5m]) to have enough data [08:38:16] PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:40:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18090 and previous config saved to /var/cache/conftool/dbconfig/20211210-084024-marostegui.json [08:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:19] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [08:53:24] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:31] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P18091 and previous config saved to /var/cache/conftool/dbconfig/20211210-085529-marostegui.json [08:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:17] (03CR) 10Kormat: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/745489 (https://phabricator.wikimedia.org/T296373) (owner: 10Filippo Giunchedi) [09:10:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18092 and previous config saved to /var/cache/conftool/dbconfig/20211210-091034-marostegui.json [09:10:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:10:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:40] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:10:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18093 and previous config saved to /var/cache/conftool/dbconfig/20211210-091041-marostegui.json [09:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:33] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: default to email blackhole [puppet] - 10https://gerrit.wikimedia.org/r/745489 (https://phabricator.wikimedia.org/T296373) (owner: 10Filippo Giunchedi) [09:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18094 and previous config saved to /var/cache/conftool/dbconfig/20211210-091319-marostegui.json [09:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:42] (03PS1) 10JMeybohm: Deploy cert-manager after namespace creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/745757 (https://phabricator.wikimedia.org/T294560) [09:21:39] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Deploy cert-manager after namespace creation [deployment-charts] - 10https://gerrit.wikimedia.org/r/745757 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:28:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18095 and previous config saved to /var/cache/conftool/dbconfig/20211210-092823-marostegui.json [09:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:29] ACKNOWLEDGEMENT - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Btullis Working on this. T291470 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:37:07] (03PS1) 10JMeybohm: admin_ng: Explicitly set dependency to namespaces in cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) [09:43:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P18096 and previous config saved to /var/cache/conftool/dbconfig/20211210-094328-marostegui.json [09:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:00] (03PS1) 10JMeybohm: cert-manager: Use numeric UID for nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/745761 (https://phabricator.wikimedia.org/T294560) [09:48:18] (03PS1) 10JMeybohm: admin_ng: Don't let helm create namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/745762 [09:48:20] (03PS1) 10JMeybohm: admin_ng: Bump cert-manager image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745763 (https://phabricator.wikimedia.org/T294560) [09:48:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cert-manager: Use numeric UID for nobody [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/745761 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:54:23] (03CR) 10Jelto: [C: 03+1] "this LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [09:58:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T277354)', diff saved to https://phabricator.wikimedia.org/P18097 and previous config saved to /var/cache/conftool/dbconfig/20211210-095833-marostegui.json [09:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:38] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:00:26] RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:00:48] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:51] !log repool cp5006 [10:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:44] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:02:26] RECOVERY - aqs endpoints health on aqs1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:02:46] RECOVERY - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.204 port 9042 https://phabricator.wikimedia.org/T93886 [10:02:48] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:02] RECOVERY - aqs endpoints health on aqs1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:06] RECOVERY - aqs endpoints health on aqs1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:20] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:55] !log published docker-registry.discovery.wmnet/cert-manager/cainjector:1.5.4-2 docker-registry.discovery.wmnet/cert-manager/webhook:1.5.4-2 docker-registry.discovery.wmnet/cert-manager/controller:1.5.4-2 [10:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:17] godog: only catching up now thanks for the help! [10:13:37] I still haven't really wrapped my head around how the test works, but if Jenkins is happy then that's great :) [10:14:09] (03CR) 10Cathal Mooney: [C: 03+1] "Thanks for the help Filippo!" [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [10:14:41] topranks: sure np! happy to help with what's unclear in the test too [10:15:01] well really just what the "values" field should represent. [10:15:26] (03CR) 10Jbond: [C: 03+1] "lgtm couple of nits" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:16:39] Like that I read the docs and can see how your expression will be expanded, but I don't know why that makes sense in terms of the particular query. [10:17:58] RECOVERY - dump of s3 in eqiad on alert1001 is OK: Last dump for s3 at eqiad (db1145.eqiad.wmnet:3313) taken on 2021-12-10 08:12:38 (120 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:19:59] topranks: ah I see, my understanding is that 'values' represents the data points over time that should be tested [10:20:12] in this case the inerrors [10:20:41] is it the number of data points the query should return? [10:21:09] or the actual values of the metric, i.e. actual number of inerrors in this case? [10:21:55] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: drop cinder-backup keyrings [puppet] - 10https://gerrit.wikimedia.org/r/745765 [10:22:11] the latter, the metric datapoints themselves not the query results [10:23:06] 10SRE, 10SRE-Access-Requests: Add Lucas_WMDE to #mediawiki_security - https://phabricator.wikimedia.org/T297226 (10Lucas_Werkmeister_WMDE) >>! In T297226#7555943, @Lucas_Werkmeister_WMDE wrote: > I can confirm that Lucas_WMDE on Libera Chat is my account, and as far as I’m aware it’s also appropriately passwor... [10:27:25] (03CR) 10Ayounsi: Pmacct add sflow listener (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:27:51] (03PS7) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [10:29:18] hmmm... so in this case the result of that query over 10 minutes gives us 10 results with value 0? [10:29:35] And the value expression you have expands out to "6 12 18 24 30 36 42" if I'm not wrong. [10:32:23] godog: I don't really get the relationship. But not to worry I can revisit again if it's not a simple explanation. [10:32:23] (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: drop cinder-backup keyrings [puppet] - 10https://gerrit.wikimedia.org/r/745765 (https://phabricator.wikimedia.org/T292546) [10:32:25] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: load_all: fix input datatype [puppet] - 10https://gerrit.wikimedia.org/r/745767 (https://phabricator.wikimedia.org/T293752) [10:32:27] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: load_all: don't fail if only keydata is defined [puppet] - 10https://gerrit.wikimedia.org/r/745768 (https://phabricator.wikimedia.org/T293752) [10:33:35] (03CR) 10Jbond: [C: 03+1] "thx lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [10:34:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/compiler1001/32947/" [puppet] - 10https://gerrit.wikimedia.org/r/745767 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:35:59] topranks: ok no problem, the way I think about it is that "6 12 18 24 30 36 42 ... etc" are the "simulated" datapoints (one per simulated minute) for the metric, then you ask the test to evaluate 'is rate(..[5m]) > 0 true for more than 10m over these datapoints?' and check that the alert fired [10:36:17] hope that makes sense! [10:38:08] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/compiler1001/32948/" [puppet] - 10https://gerrit.wikimedia.org/r/745768 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:39:08] godog: thank you! that's exactly what I was missing. [10:39:10] appreciate it. [10:39:32] sure np! glad it makes sense now [10:39:42] so the values are simulated values that would make it go off. [10:40:08] yeah exactly [10:40:23] yeah all good I knew it was something simple. The Prometheus project's docs don't really explain, tbh I've not done much proper coding and unit testing in my career so probably it's obvious if you have. [10:40:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [10:41:53] could be yeah! I agree the prometheus docs on unit testing alerts are a bit thin [10:42:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! feel free to merge anytime, alerts will get auto deployed" [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [10:44:19] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC diff as expected: https://puppet-compiler.wmflabs.org/compiler1001/32949/" [puppet] - 10https://gerrit.wikimedia.org/r/745765 (https://phabricator.wikimedia.org/T292546) (owner: 10Arturo Borrero Gonzalez) [10:50:28] (03PS1) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [10:52:56] (03CR) 10Cathal Mooney: [C: 03+2] Adjust PingOffloadMissingIP alert so it recovers properly [alerts] - 10https://gerrit.wikimedia.org/r/745563 (https://phabricator.wikimedia.org/T297397) (owner: 10Cathal Mooney) [10:53:16] (PingOffloadMissingIP) firing: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [10:56:11] (03CR) 10Ladsgroup: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [10:59:36] (03CR) 10Jelto: "looks mostly good for me but I think we have some problem with dependencies here. Namespaces helmfile will create some certificates (thus " [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [11:00:13] (03CR) 10Jelto: [C: 03+1] admin_ng: Explicitly set dependency to namespaces in cert-manager (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:06:50] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: eqiad: fix rgw client name [puppet] - 10https://gerrit.wikimedia.org/r/745779 (https://phabricator.wikimedia.org/T293752) [11:08:11] (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: eqiad: fix rgw client name [puppet] - 10https://gerrit.wikimedia.org/r/745779 (https://phabricator.wikimedia.org/T293752) [11:10:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: auth: eqiad: fix rgw client name [puppet] - 10https://gerrit.wikimedia.org/r/745779 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:13:01] (PingOffloadMissingIP) resolved: Target IP missing on ping2002:9100 loopback in codfw - https://wikitech.wikimedia.org/wiki/Ping_offload#InAddrErrors_alert - https://grafana.wikimedia.org/d/000000513/ping-offload - https://alerts.wikimedia.org [11:26:07] (03Abandoned) 10Hnowlan: maps: disable sync on maps2009 [puppet] - 10https://gerrit.wikimedia.org/r/715754 (owner: 10Hnowlan) [11:40:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: add local logging in rsyslog debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/745800 [11:45:16] (03CR) 10ZPapierski: sre.wdqs: Integrate wcqs with wdqs cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [11:46:05] (03PS2) 10Muehlenhoff: Enable ganeti2025 as ganeti server [puppet] - 10https://gerrit.wikimedia.org/r/745519 (https://phabricator.wikimedia.org/T282603) [11:48:50] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin1001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:51:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add local logging in rsyslog debug mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/745800 (owner: 10Giuseppe Lavagetto) [11:53:44] (03PS3) 10Hnowlan: restbase: add new hosts restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) [11:54:45] (03CR) 10Hnowlan: [C: 03+2] restbase: add new hosts restbase202[456] [puppet] - 10https://gerrit.wikimedia.org/r/745501 (https://phabricator.wikimedia.org/T297282) (owner: 10Hnowlan) [12:02:47] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2024.codfw.wmnet with OS buster [12:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:37] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:49] (03PS1) 10Jbond: O:puppet_compiler: convert dependencies to python3 [puppet] - 10https://gerrit.wikimedia.org/r/745803 [12:06:35] (03PS2) 10Jbond: O:puppet_compiler: convert dependencies to python3 [puppet] - 10https://gerrit.wikimedia.org/r/745803 [12:06:40] (03CR) 10Jbond: [V: 03+2 C: 03+2] O:puppet_compiler: convert dependencies to python3 [puppet] - 10https://gerrit.wikimedia.org/r/745803 (owner: 10Jbond) [12:08:24] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:24] (03PS1) 10Esanders: Disable DT mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745819 (https://phabricator.wikimedia.org/T295816) [12:26:32] (03PS1) 10Jbond: puppet-dev - hiera: add horizon config [puppet] - 10https://gerrit.wikimedia.org/r/745820 [12:26:49] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet-dev - hiera: add horizon config [puppet] - 10https://gerrit.wikimedia.org/r/745820 (owner: 10Jbond) [12:28:09] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2025.codfw.wmnet with OS buster [12:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:57] <_joe_> !log manually modifying configmaps for rsyslog in mwdebug for live troubleshooting. [12:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:10] !log including cassandra-tools in cassandra311 component of buster-wikimedia [12:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:23] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix slowlog rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/745831 [12:45:30] (03PS1) 10Kosta Harlan: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) [12:45:41] (03PS2) 10Kosta Harlan: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) [12:48:00] (03CR) 10JMeybohm: admin_ng: Explicitly set dependency to namespaces in cert-manager (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [12:51:21] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: puppetdb postgress server: fix dependcey loop - https://phabricator.wikimedia.org/T296550 (10jbond) [12:51:38] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2024.codfw.wmnet with OS buster [12:51:40] (03CR) 10JMeybohm: [C: 04-1] admin_ng: Create Certificates for ingressgateway (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [12:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:19] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2026.codfw.wmnet with OS buster [12:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:00] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on restbase[2024-2025].codfw.wmnet with reason: New cassandra hosts awaiting syncing [12:56:02] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on restbase[2024-2025].codfw.wmnet with reason: New cassandra hosts awaiting syncing [12:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:50] (03PS3) 10Muehlenhoff: Enable ganeti2025 as ganeti server [puppet] - 10https://gerrit.wikimedia.org/r/745519 (https://phabricator.wikimedia.org/T282603) [13:00:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:00:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1127.eqiad.wmnet with reason: Maintenance [13:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T277354)', diff saved to https://phabricator.wikimedia.org/P18098 and previous config saved to /var/cache/conftool/dbconfig/20211210-130051-marostegui.json [13:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:56] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:02:21] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Explicitly set dependency to namespaces in cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:02:25] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Don't let helm create namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/745762 (owner: 10JMeybohm) [13:02:28] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Bump cert-manager image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745763 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:03:07] (03CR) 10Jbond: [C: 03+1] mgmt: delete the entire module and role::mgmt::drac_ilo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/744874 (https://phabricator.wikimedia.org/T272559) (owner: 10Dzahn) [13:04:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T277354)', diff saved to https://phabricator.wikimedia.org/P18099 and previous config saved to /var/cache/conftool/dbconfig/20211210-130427-marostegui.json [13:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:16] (03PS2) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [13:05:51] (03CR) 10jerkins-bot: [V: 04-1] WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [13:05:53] (03Merged) 10jenkins-bot: admin_ng: Explicitly set dependency to namespaces in cert-manager [deployment-charts] - 10https://gerrit.wikimedia.org/r/745760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:06:23] (03Merged) 10jenkins-bot: admin_ng: Don't let helm create namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/745762 (owner: 10JMeybohm) [13:06:25] (03Merged) 10jenkins-bot: admin_ng: Bump cert-manager image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/745763 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [13:17:04] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2025.codfw.wmnet with OS buster [13:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18100 and previous config saved to /var/cache/conftool/dbconfig/20211210-131932-marostegui.json [13:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [13:25:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:26:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:27:53] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2001 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [13:29:46] (03PS1) 10Ladsgroup: auto_schema: Add README pointing to wikitech page [software] - 10https://gerrit.wikimedia.org/r/745837 (https://phabricator.wikimedia.org/T288235) [13:30:53] (03PS1) 10Filippo Giunchedi: graphite: backup 'daily' hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) [13:31:39] (03CR) 10Marostegui: [C: 03+1] auto_schema: Add README pointing to wikitech page [software] - 10https://gerrit.wikimedia.org/r/745837 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:31:58] (03PS2) 10Ladsgroup: auto_schema: Add README pointing to wikitech page [software] - 10https://gerrit.wikimedia.org/r/745837 (https://phabricator.wikimedia.org/T288235) [13:32:01] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Add README pointing to wikitech page [software] - 10https://gerrit.wikimedia.org/r/745837 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:32:33] (03Merged) 10jenkins-bot: auto_schema: Add README pointing to wikitech page [software] - 10https://gerrit.wikimedia.org/r/745837 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [13:34:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P18101 and previous config saved to /var/cache/conftool/dbconfig/20211210-133437-marostegui.json [13:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:53] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.21% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [13:41:09] (03PS1) 10Bartosz Dziewoński: Fix PageRecord lookup [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745652 (https://phabricator.wikimedia.org/T297431) [13:46:22] (03CR) 10Jelto: [C: 03+2] admin_ng: remove tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:48:20] (03PS1) 10Jbond: pcc: add default args [puppet] - 10https://gerrit.wikimedia.org/r/745841 [13:48:22] (03PS1) 10Jbond: puppetmaster::scripts: add upload script [puppet] - 10https://gerrit.wikimedia.org/r/745842 [13:49:23] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::scripts: add upload script [puppet] - 10https://gerrit.wikimedia.org/r/745842 (owner: 10Jbond) [13:49:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T277354)', diff saved to https://phabricator.wikimedia.org/P18102 and previous config saved to /var/cache/conftool/dbconfig/20211210-134941-marostegui.json [13:49:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Maintenance [13:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:49:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db[1155,1158].eqiad.wmnet with reason: Maintenance [13:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T277354)', diff saved to https://phabricator.wikimedia.org/P18103 and previous config saved to /var/cache/conftool/dbconfig/20211210-134953-marostegui.json [13:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:45] (03PS2) 10Jbond: pcc: add default args [puppet] - 10https://gerrit.wikimedia.org/r/745841 [13:51:00] (03PS2) 10Jbond: puppetmaster::scripts: add upload script [puppet] - 10https://gerrit.wikimedia.org/r/745842 [13:51:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T277354)', diff saved to https://phabricator.wikimedia.org/P18104 and previous config saved to /var/cache/conftool/dbconfig/20211210-135114-marostegui.json [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:43] (03PS3) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [13:51:45] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::scripts: add upload script [puppet] - 10https://gerrit.wikimedia.org/r/745842 (owner: 10Jbond) [13:51:59] (03Abandoned) 10Jbond: puppetmaster::scripts: add upload script [puppet] - 10https://gerrit.wikimedia.org/r/745842 (owner: 10Jbond) [13:52:24] (03CR) 10jerkins-bot: [V: 04-1] WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [13:55:10] (03PS4) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [13:55:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) timed out before a response was received: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:55:46] (03CR) 10jerkins-bot: [V: 04-1] WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [13:56:16] Minor note, AWS is having some issues. (https://news.ycombinator.com/item?id=29509629) [13:57:44] I don't think we can do anything to that [13:58:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:01:13] !log increase backup2004's allocated disk space [14:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18105 and previous config saved to /var/cache/conftool/dbconfig/20211210-140618-marostegui.json [14:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:00] (03PS1) 10Jbond: C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) [14:07:38] (03CR) 10jerkins-bot: [V: 04-1] C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:07:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32952/console" [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:09:19] (03PS2) 10Jbond: C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) [14:09:55] (03CR) 10jerkins-bot: [V: 04-1] C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:10:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32953/console" [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:12:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:13:04] (03PS3) 10Jbond: C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) [14:13:59] (03CR) 10Jbond: [C: 03+2] C:tomcat: manage /etc/default/tomcat9 [puppet] - 10https://gerrit.wikimedia.org/r/745846 (https://phabricator.wikimedia.org/T297468) (owner: 10Jbond) [14:17:06] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:57] (03PS2) 10Jelto: admin_ng: remove tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) [14:19:43] (03PS1) 10Muehlenhoff: Failover idp CNAME to idp2001 [dns] - 10https://gerrit.wikimedia.org/r/745850 [14:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P18106 and previous config saved to /var/cache/conftool/dbconfig/20211210-142123-marostegui.json [14:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:41] (03CR) 10Jelto: [C: 03+2] admin_ng: remove tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:26:01] (03Merged) 10jenkins-bot: admin_ng: remove tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:27:27] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:12] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:15] !log remove tiller from staging-codfw Kubernetes cluster [14:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:27] jelto: \o/ [14:30:51] (03PS2) 10Kormat: wmfdb/section: Add class for handling of sections. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249 [14:31:15] (03PS1) 10Kormat: wmfdb/addr: Add addr.py to handle addresses. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745852 [14:33:18] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:33:19] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:50] !log jelto@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] !log jelto@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:23] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2026.codfw.wmnet with OS buster [14:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T277354)', diff saved to https://phabricator.wikimedia.org/P18107 and previous config saved to /var/cache/conftool/dbconfig/20211210-143628-marostegui.json [14:36:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:36:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:33] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T277354)', diff saved to https://phabricator.wikimedia.org/P18108 and previous config saved to /var/cache/conftool/dbconfig/20211210-143636-marostegui.json [14:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T277354)', diff saved to https://phabricator.wikimedia.org/P18109 and previous config saved to /var/cache/conftool/dbconfig/20211210-143856-marostegui.json [14:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:06] (03PS1) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 [14:48:02] !log remove tiller from staging-eqiad Kubernetes cluster [14:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:08] !log jelto@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:25] !log jelto@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti2008.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [14:48:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti2008.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [14:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on restbase2026.codfw.wmnet with reason: New cassandra hosts awaiting syncing [14:49:14] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on restbase2026.codfw.wmnet with reason: New cassandra hosts awaiting syncing [14:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] !log jelto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [14:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:26] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [14:52:42] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2008. Ready to be powered off any time. [14:54:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18110 and previous config saved to /var/cache/conftool/dbconfig/20211210-145401-marostegui.json [14:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:09] !log drain primary/secondary instances off ganeti2017 T296622 [14:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [15:01:05] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:24] !log remove tiller from codfw Kubernetes cluster [15:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:59] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:34] !log increase backup2005's allocated disk space [15:06:34] (03CR) 10David Caro: "Looks good, just a question about the new __init__ files in the tests directory. The rest are just nits, feel free to ignore." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [15:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:12] !log jelto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P18111 and previous config saved to /var/cache/conftool/dbconfig/20211210-150906-marostegui.json [15:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:11] (03PS1) 10David Caro: Review access change [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 [15:14:39] (03Abandoned) 10David Caro: Review access change [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745377 (owner: 10David Caro) [15:15:39] !log remove tiller from eqiad Kubernetes cluster [15:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:45] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:01] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:01] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [15:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:56] (03PS1) 10David Caro: pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 [15:24:00] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32954/console" [puppet] - 10https://gerrit.wikimedia.org/r/745864 (owner: 10David Caro) [15:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T277354)', diff saved to https://phabricator.wikimedia.org/P18112 and previous config saved to /var/cache/conftool/dbconfig/20211210-152410-marostegui.json [15:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:16] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:25:16] (03PS1) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:25:51] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:26:52] (03PS2) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:27:02] (03CR) 10Jbond: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:27:31] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:27:45] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) First cleanup task is finished: [x] remove tiller and tiller service accounts ([742989](https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/742989)) Tiller deploymen... [15:27:59] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [15:29:52] (03PS2) 10David Caro: pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 [15:29:54] (03PS1) 10David Caro: puppet-diffs: use canonical cloud domain [puppet] - 10https://gerrit.wikimedia.org/r/745886 [15:30:00] (03PS3) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:30:39] (03CR) 10David Caro: P:puppet_compiler: add uploader proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:30:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 (owner: 10David Caro) [15:30:50] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:31:19] (03PS4) 10Majavah: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 [15:31:26] (03CR) 10Jbond: [C: 03+1] pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 (owner: 10David Caro) [15:31:40] (03PS1) 10Jgreen: nsca_frack.cfg.erb monitor mail queue on frdev1001 [puppet] - 10https://gerrit.wikimedia.org/r/745887 (https://phabricator.wikimedia.org/T297304) [15:32:08] (03CR) 10Majavah: Check for start npm script (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [15:32:17] (03CR) 10Jbond: [C: 03+2] pcc: add default args [puppet] - 10https://gerrit.wikimedia.org/r/745841 (owner: 10Jbond) [15:33:37] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb monitor mail queue on frdev1001 [puppet] - 10https://gerrit.wikimedia.org/r/745887 (https://phabricator.wikimedia.org/T297304) (owner: 10Jgreen) [15:34:24] (03CR) 10David Caro: [C: 03+1] "+💯!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [15:35:09] jbond: I was just about to puppet-merge and picked up your commit, ok to deploy it? [15:36:05] (03PS4) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:37:42] (03CR) 10David Caro: [C: 04-1] puppet-diffs: use canonical cloud domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745886 (owner: 10David Caro) [15:37:45] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:37:56] (03PS5) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:37:59] (03CR) 10Jbond: P:puppet_compiler: add uploader proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:39:34] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:40:30] (03PS6) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:40:59] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:41:08] (03CR) 10David Caro: [C: 03+2] Review access change [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 (owner: 10David Caro) [15:41:11] (03CR) 10David Caro: [V: 03+2 C: 03+2] Review access change [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 (owner: 10David Caro) [15:41:29] (03CR) 10David Caro: [V: 03+2 C: 03+2] Review access change (031 comment) [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 (owner: 10David Caro) [15:42:21] (03CR) 10jerkins-bot: [V: 04-1] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:42:23] (03PS2) 10David Caro: puppet-diffs: use canonical cloud domain and new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745886 [15:42:25] (03PS3) 10David Caro: pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 [15:42:27] (03CR) 10David Caro: [C: 04-1] puppet-diffs: use canonical cloud domain and new compiler name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745886 (owner: 10David Caro) [15:43:13] (03CR) 10David Caro: [V: 03+1] "Tested this directly on pcc-db1001 successfully" [puppet] - 10https://gerrit.wikimedia.org/r/745886 (owner: 10David Caro) [15:44:54] (03PS7) 10Jbond: P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 [15:45:03] (03Abandoned) 10David Caro: puppet-diffs: use canonical cloud domain and new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745886 (owner: 10David Caro) [15:46:58] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:47:42] (03CR) 10Majavah: [C: 03+2] Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [15:48:06] (03CR) 10Jbond: [C: 03+2] P:puppet_compiler: add uploader proxy [puppet] - 10https://gerrit.wikimedia.org/r/745865 (owner: 10Jbond) [15:48:57] Jeff_Green: ok to merge your change [15:48:57] (03Merged) 10jenkins-bot: Check for start npm script [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/739111 (owner: 10Majavah) [15:49:08] jbond: yes [15:49:16] cool, merged [15:49:20] thx [15:49:29] np :) [15:50:58] (03CR) 10Jbond: Review access change (031 comment) [software/puppet-compiler] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/745868 (owner: 10David Caro) [15:52:05] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:54:17] !log increase backup2006's allocated disk space [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:24] (03PS1) 10Jbond: puppet_compiler: correct template location [puppet] - 10https://gerrit.wikimedia.org/r/745892 [15:58:36] (03CR) 10Bartosz Dziewoński: [C: 03+1] Disable DT mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745819 (https://phabricator.wikimedia.org/T295816) (owner: 10Esanders) [15:59:38] anyone around who could +2 a beta-only config change for me on this fine friday? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/745819 [15:59:56] (or will you make me wait until next week? ;) ) [16:00:05] * Reedy looks [16:00:05] (03CR) 10Jbond: [C: 03+2] puppet_compiler: correct template location [puppet] - 10https://gerrit.wikimedia.org/r/745892 (owner: 10Jbond) [16:01:06] (03CR) 10Reedy: [C: 03+2] Disable DT mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745819 (https://phabricator.wikimedia.org/T295816) (owner: 10Esanders) [16:01:13] (03PS1) 10JMeybohm: admin_ng: Add cert-manager networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/745893 (https://phabricator.wikimedia.org/T294560) [16:01:15] (03PS1) 10JMeybohm: cert-manager/cfssl allow override of KUBERNETES_SERVICE envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/745894 (https://phabricator.wikimedia.org/T294560) [16:02:02] (03Merged) 10jenkins-bot: Disable DT mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745819 (https://phabricator.wikimedia.org/T295816) (owner: 10Esanders) [16:02:21] thanks Reedy [16:05:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:35] (03CR) 10David Caro: [C: 03+2] pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 (owner: 10David Caro) [16:07:48] (03PS4) 10David Caro: pcc: use the new compiler name [puppet] - 10https://gerrit.wikimedia.org/r/745864 [16:17:22] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Add cert-manager networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/745893 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:17:25] (03CR) 10JMeybohm: [C: 03+2] cert-manager/cfssl allow override of KUBERNETES_SERVICE envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/745894 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:19:03] (03PS1) 10Cwhite: opensearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745900 (https://phabricator.wikimedia.org/T297468) [16:21:03] (03Merged) 10jenkins-bot: admin_ng: Add cert-manager networkpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/745893 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:21:20] (03Merged) 10jenkins-bot: cert-manager/cfssl allow override of KUBERNETES_SERVICE envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/745894 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:23:19] (03PS1) 10Cwhite: elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) [16:23:34] (03CR) 10Jcrespo: [C: 04-1] "As I suspected, current graphite model in use doesn't work well together with bacula. Bacula is able to produce both differential and incr" [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [16:25:24] (03PS1) 10Dzahn: wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 [16:26:10] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [16:26:13] (03PS1) 10Jbond: puppet_compiler: Rewrite the URL to strip upload [puppet] - 10https://gerrit.wikimedia.org/r/745903 [16:28:09] (03PS1) 10Cwhite: logstash: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745905 (https://phabricator.wikimedia.org/T297468) [16:29:49] (03CR) 10Filippo Giunchedi: [C: 03+1] opensearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745900 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:30:04] (03CR) 10Filippo Giunchedi: [C: 03+1] elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:30:54] (03CR) 10Herron: [C: 03+1] opensearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745900 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:31:15] (03CR) 10Herron: [C: 03+1] elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:31:57] (03PS2) 10Dzahn: wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 [16:32:27] dancy: if you're around, i think https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/745652 (wmf.12 bug fix) could be deployed [16:32:37] (03CR) 10jerkins-bot: [V: 04-1] wikistats: add /usr/local/bin/wikistats/ to PATH for all users [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [16:33:31] MatmaRex: Excellent. [16:33:51] I will do so now. [16:34:20] (03CR) 10Jbond: [C: 03+2] puppet_compiler: Rewrite the URL to strip upload [puppet] - 10https://gerrit.wikimedia.org/r/745903 (owner: 10Jbond) [16:35:33] dancy: actually, while i have your attention, would you be willing to also backport the fix for https://phabricator.wikimedia.org/T297418 ? (not new in wmf.12, but it was only reported yesterday) [16:36:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/745900 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:36:35] MaxmaRex: Sure [16:36:37] (03PS1) 10JMeybohm: cfssl-issuer: Fix image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/745907 (https://phabricator.wikimedia.org/T294560) [16:36:56] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/745652 is on mwdebug1001. Can you test it? [16:36:56] (03PS1) 10Btullis: Update the SSH public key for Olja Dimitrievic [puppet] - 10https://gerrit.wikimedia.org/r/745908 (https://phabricator.wikimedia.org/T282836) [16:36:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Fix image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/745907 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:37:12] (03CR) 10Herron: [C: 03+1] prometheus: add blackbox/discovery jobs [puppet] - 10https://gerrit.wikimedia.org/r/743979 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:37:29] looking [16:37:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, but will need sync up with the main search cluster :-)" [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:37:46] (03PS1) 10Bartosz Dziewoński: ve.ui.MWReferencesListDialog: Fix exception caused by a copy-paste mistake [extensions/Cite] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745872 (https://phabricator.wikimedia.org/T297418) [16:38:11] (03CR) 10Herron: [C: 03+1] prometheus: add alerts for network probes [puppet] - 10https://gerrit.wikimedia.org/r/743980 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:38:38] (03CR) 10Cwhite: [C: 03+2] opensearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745900 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:39:03] dancy: for the DiscussionTools bug, i don't actually know how to reproduce the error. i guess i can make some edits and confirm that i get a notification [16:39:19] MatmaRex: Sounds like a plan [16:39:24] dancy: but i think the real test will be watching the logs afterwards and seeing if the issue disappears [16:39:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/745905 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:40:12] ready for me to roll it out fully then? [16:40:33] (03CR) 10Btullis: [C: 03+2] Update the SSH public key for Olja Dimitrievic [puppet] - 10https://gerrit.wikimedia.org/r/745908 (https://phabricator.wikimedia.org/T282836) (owner: 10Btullis) [16:40:56] (03CR) 10Cwhite: [C: 03+2] logstash: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745905 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [16:41:03] (03PS2) 10Cwhite: logstash: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745905 (https://phabricator.wikimedia.org/T297468) [16:42:06] dancy: yeah. and fwiw, i got a notification for this edit https://test.wikipedia.org/w/index.php?title=Talk:T297431&diff=493585&oldid=493584 [16:42:07] T297431: InvalidArgumentException: The revision does not belong to the given page. - https://phabricator.wikimedia.org/T297431 [16:42:22] ok.. proceeding [16:43:00] the other backport is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/745872 [16:43:01] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [16:43:37] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: Backport: [[gerrit:745652|Fix PageRecord lookup (T297431)]] (duration: 00m 58s) [16:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:11] MatmaRex: Starting 745872 [16:45:37] (03PS2) 10Jcrespo: graphite: backup 'daily' hierarchy, with weekly frequency, every Monday [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [16:46:04] MatmaRex: deployed to mwdebug1001. Lemme know when you're ready to proceed. [16:46:49] dancy: looking [16:46:51] (03PS1) 10JMeybohm: cfssl-issuer: Rely on docker entrypoint rather than command in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/745912 (https://phabricator.wikimedia.org/T294560) [16:47:04] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Rely on docker entrypoint rather than command in chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/745912 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:47:06] dancy: btw, i think you need to merge them both in gerrit as well? [16:47:19] oh cripes. [16:47:31] Hopefully in a few weeks we'll have a single command that does all the stuff. [16:47:47] alright, starting over. [16:48:03] (03CR) 10Jcrespo: "Please have a look at how it looks now. Monday (around 4 am UTC) is taken at random- if you have good reasons to do it any other day, we c" [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [16:48:27] (03CR) 10Ahmon Dancy: [C: 03+2] Fix PageRecord lookup [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745652 (https://phabricator.wikimedia.org/T297431) (owner: 10Bartosz Dziewoński) [16:48:43] dancy: i still see the issue when testing with mwdebug1001, i assume it wasn't actually deployed? [16:48:51] right. [16:48:55] Operator error. :-) [16:49:01] yeah, np :) [16:50:46] dancy: i think you could +2 the Cite backport now as well, so that it'll be done faster [16:51:02] (03CR) 10Ahmon Dancy: [C: 03+2] ve.ui.MWReferencesListDialog: Fix exception caused by a copy-paste mistake [extensions/Cite] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745872 (https://phabricator.wikimedia.org/T297418) (owner: 10Bartosz Dziewoński) [16:54:42] (03Merged) 10jenkins-bot: Fix PageRecord lookup [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745652 (https://phabricator.wikimedia.org/T297431) (owner: 10Bartosz Dziewoński) [16:55:40] 10SRE, 10SRE Observability (FY2021/2022-Q2): DX App Synthetic Monitoring App - watchmouse alert flapping due to CA expiration - https://phabricator.wikimedia.org/T292603 (10lmata) @Volans there are no news yet, I have reached out again and am now working on a replacement and will turning off these alerts in fa... [16:56:35] !log increase backup2007's allocated disk space [16:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:39] MatmaRex: Is this dealing with `.12 i/p/ParserOutputAccess:339 The revision does not belong to the given page.` ? [16:56:54] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/DiscussionTools/includes/Notifications/EventDispatcher.php: Backport: [[gerrit:745652|Fix PageRecord lookup (T297431)]] (duration: 00m 58s) [16:56:54] yes [16:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:59] T297431: InvalidArgumentException: The revision does not belong to the given page. - https://phabricator.wikimedia.org/T297431 [16:57:00] ok.. really deployed now [16:58:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:53] (03CR) 10Dzahn: ""Could not parse for environment *root*: Illegal variable name, The given name 'PATH' does not conform to the naming rule" yea, not my fau" [puppet] - 10https://gerrit.wikimedia.org/r/745902 (owner: 10Dzahn) [16:59:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:57] (03PS1) 10Andrew Bogott: Add initial script to manage/automate cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) [17:00:12] (03PS2) 10Vgutierrez: cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) [17:05:47] (03PS5) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:06:23] (03CR) 10jerkins-bot: [V: 04-1] WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 (owner: 10Jbond) [17:07:44] (03CR) 10Lucas Werkmeister (WMDE): graphite: backup 'daily' hierarchy, with weekly frequency, every Monday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745838 (https://phabricator.wikimedia.org/T294355) (owner: 10Filippo Giunchedi) [17:08:20] (03CR) 10Btullis: [C: 03+1] Pmacct add sflow listener (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [17:10:14] (03CR) 10Btullis: "I was working directly with the user in Google Meet and watched her generate the new key." [puppet] - 10https://gerrit.wikimedia.org/r/745908 (https://phabricator.wikimedia.org/T282836) (owner: 10Btullis) [17:10:51] (03CR) 10Dzahn: wdqs: switch GUI deployment from latest to present (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [17:12:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix slowlog rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/745831 (owner: 10Giuseppe Lavagetto) [17:13:28] (03Merged) 10jenkins-bot: ve.ui.MWReferencesListDialog: Fix exception caused by a copy-paste mistake [extensions/Cite] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/745872 (https://phabricator.wikimedia.org/T297418) (owner: 10Bartosz Dziewoński) [17:14:53] (03PS6) 10Jbond: WIP - puppetmaster: add upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:15:24] (03PS7) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [17:15:52] (03Merged) 10jenkins-bot: mediawiki: fix slowlog rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/745831 (owner: 10Giuseppe Lavagetto) [17:16:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:59] MatmaRex: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/745872 deployed to mwdebug1001 [17:20:11] dancy: thanks, confirmed it fixes the issue [17:20:18] Word [17:21:27] !log dancy@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/Cite/modules/ve-cite/ve.ui.MWReferencesListDialog.js: Backport: [[gerrit:745872|ve.ui.MWReferencesListDialog: Fix exception caused by a copy-paste mistake (T297418)]] (duration: 00m 58s) [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:32] T297418: The popup window for editing the list of references freezes - https://phabricator.wikimedia.org/T297418 [17:23:03] 10SRE, 10Infrastructure-Foundations: reimage physical host with new hostname mirror1001 - https://phabricator.wikimedia.org/T297508 (10jhathaway) [17:26:55] thanks dancy [17:27:02] 👍🏾 [17:27:12] Thanks for the fixes! [17:38:06] (03PS1) 10JHathaway: copernicium: decom, in prep for renaming to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 [17:38:49] (03CR) 10jerkins-bot: [V: 04-1] copernicium: decom, in prep for renaming to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 (owner: 10JHathaway) [17:41:42] (03PS2) 10JHathaway: copernicium: decom, in prep for renaming to mirror1001 [puppet] - 10https://gerrit.wikimedia.org/r/745920 (https://phabricator.wikimedia.org/T297508) [17:42:47] (03PS2) 10Andrew Bogott: Add initial script to manage/automate cinder backups [puppet] - 10https://gerrit.wikimedia.org/r/745917 (https://phabricator.wikimedia.org/T294429) [17:55:29] (03PS1) 10JMeybohm: cfssl-issuer: Fix metrics listen address and probes [deployment-charts] - 10https://gerrit.wikimedia.org/r/745923 (https://phabricator.wikimedia.org/T294560) [17:58:14] (03CR) 10Jbond: [C: 03+1] "LGTM but i wonder if it makes senses to just rename them in this PS, instead of having an additional CR" [puppet] - 10https://gerrit.wikimedia.org/r/745920 (https://phabricator.wikimedia.org/T297508) (owner: 10JHathaway) [17:58:57] 10SRE, 10API Platform, 10Desktop Improvements, 10MediaWiki-REST-API, and 10 others: Rest API incorrectly publicly caches results from private wikis - https://phabricator.wikimedia.org/T292763 (10Reedy) [18:04:05] !log jhathaway@cumin1001 START - Cookbook sre.hosts.decommission for hosts copernicium.wikimedia.org [18:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:14] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:08:12] (03CR) 10Ebernhardson: [C: 03+1] elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [18:10:38] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:11:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [18:11:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [18:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:44] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [18:12:57] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2017. Ready to be powered off any time. [18:17:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/743254 (owner: 10JHathaway) [18:21:44] (03CR) 10JHathaway: [C: 03+2] icinga: authorize myself, jhathaway, to run commands [puppet] - 10https://gerrit.wikimedia.org/r/743254 (owner: 10JHathaway) [18:21:59] (03PS2) 10Cwhite: elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) [18:28:37] (03PS1) 10Andrew Bogott: Add simple script to backup cinder volumes according to yaml config [puppet] - 10https://gerrit.wikimedia.org/r/745926 (https://phabricator.wikimedia.org/T294429) [18:31:29] (03CR) 10Cwhite: [C: 03+2] elasticsearch: manage log4j2 msg formatting [puppet] - 10https://gerrit.wikimedia.org/r/745901 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [18:32:13] jhathaway: I merged your changes with mine :) [18:32:46] thanks [18:35:17] (03CR) 10Jbond: "Sorry missed the last response" [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [18:37:30] (03CR) 10Jbond: "sorry i missed this can you resubmit and ping me on monday to merge (or just +2 and merge)" [puppet] - 10https://gerrit.wikimedia.org/r/743380 (owner: 10David Caro) [18:37:47] (03PS2) 10Jbond: pcc: add possibility to fail fast [puppet] - 10https://gerrit.wikimedia.org/r/743379 (https://phabricator.wikimedia.org/T296984) (owner: 10David Caro) [18:38:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] pcc: add possibility to fail fast [puppet] - 10https://gerrit.wikimedia.org/r/743379 (https://phabricator.wikimedia.org/T296984) (owner: 10David Caro) [18:38:55] (03CR) 10Jbond: "i merged this (although it may not work just yet due to lacking pcc vm upgrades)" [puppet] - 10https://gerrit.wikimedia.org/r/743379 (https://phabricator.wikimedia.org/T296984) (owner: 10David Caro) [18:43:01] (03CR) 10Jbond: [C: 03+1] Deprecate interface-range external [homer/public] - 10https://gerrit.wikimedia.org/r/744782 (https://phabricator.wikimedia.org/T296935) (owner: 10Ayounsi) [18:45:13] (03PS8) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [18:45:27] (03PS9) 10Jbond: puppetmaster: add puppet-facts-upload job to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/745599 [18:49:27] (03PS1) 10Jdlrobson: Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) [18:49:50] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:55] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts copernicium.wikimedia.org [18:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:02] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: reimage physical host with new hostname mirror1001 - https://phabricator.wikimedia.org/T297508 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jhathaway@cumin1001 for hosts: `copernicium.wikimedia.org` - copernicium.wikimedia.org (*... [18:52:30] (03CR) 10Jdlrobson: [C: 04-1] "Debugging further" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [18:57:31] (03PS2) 10Jdlrobson: Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) [19:03:58] (03CR) 10Clare Ming: [C: 03+1] Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [19:13:58] (03PS3) 10Clare Ming: Fix format of VectorWebABTestEnrollment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745931 (https://phabricator.wikimedia.org/T295972) (owner: 10Jdlrobson) [19:15:34] (03PS1) 10Jdlrobson: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) [19:21:06] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Decom cookbook should only warn about unexpected matches in Puppet - https://phabricator.wikimedia.org/T297516 (10RLazarus) p:05Triage→03Medium [19:21:31] (03PS1) 10Jdlrobson: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) [19:28:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [19:33:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [19:44:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson prometheus1005 A6 U38 Cableid# 3287 Port33 prometheus1006 B6 U39 Cableid# 3967 Po... [19:44:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Jclark-ctr) [19:58:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Jclark-ctr) @RKemper rack A6 is not 10g rack b4 has no space. Are there any other requirements before i rack in same rows 10g racks? [19:58:45] (03PS1) 10Andrew Bogott: Prepare cloudmetrics100[3,4] to replace cloudmetrics100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/745948 (https://phabricator.wikimedia.org/T289888) [19:58:47] (03PS1) 10Andrew Bogott: cloudmetrics: replace cloudmetrics1002 with 1003 as the backup host [puppet] - 10https://gerrit.wikimedia.org/r/745949 (https://phabricator.wikimedia.org/T289888) [19:58:49] (03PS1) 10Andrew Bogott: cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary [puppet] - 10https://gerrit.wikimedia.org/r/745950 (https://phabricator.wikimedia.org/T289888) [19:58:51] (03PS1) 10Andrew Bogott: cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom [puppet] - 10https://gerrit.wikimedia.org/r/745951 (https://phabricator.wikimedia.org/T289888) [20:01:07] (03PS1) 10Majavah: hadmin Add wmcs-roots/labtest-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952 [20:01:27] (03PS2) 10Majavah: admin: Add wmcs-roots/labtest-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952 [20:01:58] Out of memory errors happening on wtp1025: https://phabricator.wikimedia.org/T297517 [20:05:00] (03PS1) 10Majavah: admin: Add wmcs-roots/labtest-roots to cloudbackup nodes [puppet] - 10https://gerrit.wikimedia.org/r/745953 [20:06:37] dancy: hm, that's not great [20:07:45] (Juniper alarm active) firing: Juniper alarm active - https://alerts.wikimedia.org [20:07:57] (03CR) 10Andrew Bogott: [C: 03+2] admin: Add wmcs-roots/labtest-roots to cloudbackup nodes [puppet] - 10https://gerrit.wikimedia.org/r/745953 (owner: 10Majavah) [20:08:11] (03PS3) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [20:09:12] dancy: fwiw it looks like all wtp* hosts are in about the same state, wtp1025 is just the leading indicator https://grafana.wikimedia.org/goto/pHHLyLhnz [20:09:24] ok. I'll update the task description. [20:10:18] (03CR) 10Andrew Bogott: "Since this is network infra I defer to Arturo about whether this counts as a cloud-roots host or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/745952 (owner: 10Majavah) [20:12:16] dancy: T296098 also affected parsoid memory a few weeks back, not sure if it's closely related yet -- akosiaris ended up doing a rolling restart in that case, which I was thinking of doing again here, especially as we head into the weekend [20:12:17] T296098: 1.38.0-wmf.9 seems to have introduced a memory leak - https://phabricator.wikimedia.org/T296098 [20:12:56] Worth a shot. [20:13:46] might take some pressure off at least [20:14:10] actually, here's that same graph expanded to a week, you can see some inflection points in the last few days https://grafana.wikimedia.org/goto/yCsqsY27z [20:14:58] there's no easy way to overlay deployments in that particular view, do you know offhand if they correlate with the train? e.g. that little elbow at about 2021-12-10 08:00 [20:15:35] unlikely we'd have rolled anything right then, I guess [20:17:14] yeah, no train movements today, but there were backports. [20:18:21] but not at that time. [20:18:47] hmm, okay [20:19:53] out of curiousity, what is using most mem on wtp1025 at the moment? [20:20:06] (What process) [20:20:34] reasonably uniform spread across all the php-fpm workers [20:21:44] note that wtp1025 is specifically marked as a canary server [20:21:48] about what you'd expect, I guess -- a single outlier worker would be interesting but uniform growth is more likely, especially since all the hosts are affected [20:21:50] unlike most other wtp* [20:21:54] oh! thank you [20:22:20] modules/profile/templates/cumin/aliases.yaml.erb:parsoid-canary: P{wtp1025.eqiad.wmnet or wtp1026.eqiad.wmnet or parse2001.codfw.wmnet or parse2002.codfw.wmnet} [20:22:22] that might explain why it maxed out earliest, whatever's leaking started leaking there first [20:22:31] ^ 4 of them [20:22:51] what does wtp stand for? [20:22:55] 1026 memory is also high, so that tracks [20:23:17] dancy: "wikitext parser" -- it's the older name for the parsoid hosts, equivalent to parse* in codfw [20:23:20] this is also confirmed in actual: ~/puppet/conftool-data$ grep -r parsoid * | grep canary btw [20:23:35] dancy: wiki text parser [20:23:38] parsoid [20:23:44] thx [20:24:00] but that's the old name. the new name is "parse*" [20:24:20] dancy: from the deployment calendar it looks like "UTC late backport and config training" was actually at that time, can you check me? [20:24:21] eqiad needs wtp -> parse at some point to catch up [20:25:09] it's sorted under the Thursday heading for PST reasons I guess but it was Friday 08:00 UTC if I read this correctly [20:25:33] oh wait no that's 0:00 UTC, I was double-converting [20:25:48] bleh :) sorry, it was such an attractive theory too [20:27:03] https://sal.toolforge.org/production?p=1 only shows db maintainance at that time [20:27:27] yeah, shouldn't be a factor [20:27:58] mutante: any opinion on the rolling restart? [20:29:30] would start with one machine first, probably 1025, verify it goes as expected, and then do the others 10% at a time or so [20:29:30] Seems like whatever is to blame happened around 11/19 - 11/23. [20:30:02] 10ops-eqiad, 10Infrastructure-Foundations, 10netbox: eqad: asw2-c7-eqiad PEM1 not powered - https://phabricator.wikimedia.org/T297518 (10Papaul) [20:30:10] but looking back over 90 days, it seems like steady growth until (presumably) restart is the normal pattern. [20:30:18] It's just happening more rapidly now. [20:30:41] 10ops-eqiad, 10Infrastructure-Foundations, 10netbox: eqad: asw2-c7-eqiad PEM1 not powered - https://phabricator.wikimedia.org/T297518 (10Papaul) p:05Triage→03Medium [20:30:45] rzl: +1 to try it at least on that one machine, wtp1025 [20:30:47] btw, what is this graph showing me, rzl ? [20:31:11] dancy: oh sorry - that's memory in use, in bytes, per host [20:31:22] I am not sure I already have the complete info. is one server misbehaving or are they all under higher load [20:31:34] which volume of `free` does it correspond to? [20:31:38] *column [20:31:53] dancy: on https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1, under "memory per host," it's all the "used" lines stacked up on each other [20:31:54] I guess none of them. hehe [20:32:14] (interesting, https://grafana.wikimedia.org/goto/yCsqsY27z requires login) [20:32:34] rzl: saw more of the backlog. yes, that sounds like a good plan to me [20:32:38] AntiComposite: oh sorry, I shorted the logged-in link! one sec [20:32:41] *shortened [20:32:47] alright.. anonymous memory used.. cache not included. Got it. [20:33:19] AntiComposite: try this one, thanks for saying https://w.wiki/4Xs6 [20:33:29] nope :) [20:33:36] huh. [20:34:03] okay, I guess that's just a logged-in-only feature :| [20:34:24] grafana explore isn't available to non logged in users [20:34:25] well, here's the graph I'm looking at: https://usercontent.irccloud-cdn.com/file/jzKFYhON/image.png [20:34:35] rzl: +1 on the rolling restart as a bandaid [20:35:01] yea, doesn't look like the canaries behave differently, they all moved up together [20:35:04] btw, appserver cluster is going to suffer the same fate soon [20:35:18] see https://w.wiki/4Xs8 [20:35:26] 10ops-eqiad, 10Infrastructure-Foundations, 10netbox: eqad: asw2-c7-eqiad PEM1 not powered - https://phabricator.wikimedia.org/T297518 (10Papaul) [20:35:34] akosiaris: oh rad, didn't think you'd be online this late :) and yeah that's part of why I'm eyeballing this so closely [20:35:41] (thanks dancy for the early warning btw, this probably spared us an outage) [20:36:30] akosiaris: https://wikitech.wikimedia.org/wiki/Service_restarts#Parsoid says I can just bounce it with servicectl, do I really not need to depool first? [20:36:38] api cluster is ok, but whatever happened yesterday seems to have introduced something that leaks on appserver and parsoid [20:36:42] er, *systemctl obviously [20:36:52] I think it's something in the wmf.12 train. [20:37:03] oh wait that's got to be parsoid-js [20:37:05] rzl: restart-php7.2-fpm [20:37:08] looks like the ramp up started on 12/7 [20:37:10] yeah that's what I just came around to [20:37:23] put it in cumin with a -b2 or something similar and you are good to go [20:37:25] yes, docs are about the js service there [20:37:27] 👍 [20:37:36] I'll update that page after this [20:37:40] I am more worried about the appserver cluster tbh [20:38:09] nod.. well on the way to maxing out [20:38:11] !log rzl@wtp1025:~$ sudo restart-php7.2-fpm - T297517 - rolling restart to follow [20:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:17] T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 [20:38:31] let's eyeball 1025 for a sec [20:40:28] AntiComposite: The link that akosiaris posted is accessible w/o logging in [20:41:26] 1025 is back and climbing again, we'll have to see if this is enough to get us through a full weekend -- we may still need to identify and fix this today [20:41:42] but I'm inclined to go ahead with the rolling restart, any objections? [20:41:53] rzl: none from me [20:41:56] No objections [20:42:25] updated docs https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1936357&oldid=1936072 [20:43:20] !log sudo cumin -b2 -s10 -p0 'A:parsoid and not P{wtp1025.eqiad.wmnet}' restart-php7.2-fpm - T297517 [20:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:26] T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 [20:43:27] akosiaris: In https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=86&orgId=1&from=now-7d&to=now&var-site=eqiad&var-cluster=appserver&var-instance=All&var-datasource=thanos Are the dips in memory use around deployments due to the php-fpm restart logic? [20:45:07] mutante: perfect, thank you [20:46:14] dancy: should be, yeah -- do I remember right that we restart php-fpm these days as part of the deploy? [20:46:32] yeah, it happens after every scap sync-* operation. [20:46:38] that'll do it then, yep [20:47:11] in a perfect world with no memory leaks, that wouldn't have much effect on steady-state memory usage -- in practice, it means you always clean up a little bit [20:48:09] (I don't think we hit this much with MW, but in the industry, services that normally deploy every day often run into memory-leak issues over the holidays, because a deployment freeze means they don't restart as often as they're used to) [20:48:28] nod [20:49:50] rolling restart complete [20:50:07] LB warnings from all the codfw hosts, which I *think* is normal but double-checking [20:50:15] e.g. 2021-12-10 20:43:20,095 [WARNING] LB lvs2009:9090 reports pool parsoid-php_443/parse2002.codfw.wmnet as disabled/up/not pooled, should be enabled/up/pooled [20:51:11] full output https://phabricator.wikimedia.org/P18114 [20:51:53] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:52:31] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netbox: eqad: asw2-c7-eqiad PEM1 not powered - https://phabricator.wikimedia.org/T297518 (10Papaul) 05Open→03Resolved a:03Papaul John went back to check, it was a loose power cable. All good now. resolving this . @Jclark-ctr thanks [20:52:45] (Juniper alarm active) resolved: Juniper alarm active - https://alerts.wikimedia.org [21:00:26] okay, do we want a rolling restart on A:mw-eqiad as well? is it better if I depool one host and leave it for investigation? [21:03:58] rzl: we can NOT do it on mwdebug1001 and leave that as is [21:05:00] mutante: I don't think mwdebug1001 gets enough traffic to be affected [21:05:22] from the graphs its memory utilization is trending down [21:05:35] I might just skip one of the canaries instead [21:05:55] I wanted to say that and opened tab to get list of canaries, yes [21:06:02] cool [21:06:37] node/eqiad.yaml: mw1414.eqiad.wmnet: [apache2,nginx,canary] [21:06:40] this one [21:07:34] like we did some other time, lowest number that is canary and not debug [21:08:03] sure [21:09:09] keeping batches of 10%, so -b7 out of 72 [21:09:41] !log rzl@mw1414:~$ sudo depool - preserving for investigation, T297517 [21:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:47] T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 [21:10:40] !log sudo cumin -b7 -s10 -p0 'A:mw-eqiad and not P{mw1414.eqiad.wmnet}' restart-php7.2-fpm [21:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:53] oops, task number [21:10:57] I'll add it to phab manually [21:14:46] appserver rolling restart complete [21:15:46] sees global memory usage go down in cluster overview [21:15:50] eqiad [21:16:24] appservers did serve some 503s during that process, I guess for requests that were inflight despite the five-second pause after depooling [21:16:30] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10Majavah) [21:18:30] https://phabricator.wikimedia.org/P18114#92399 [21:18:57] (03PS3) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [21:18:58] did not even try to deep link, just a screenshot for the record [21:18:59] (03CR) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [21:19:11] smart [21:27:14] I'm afk just a sec to heat up some food but nominally still looking at this [21:27:42] https://w.wiki/4Xsx [21:28:12] the slope changes at approx 2021-12-09 20:00 [21:29:00] https://sal.toolforge.org/log/QzbLoH0Ba_6PSCT98KKU [21:31:22] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10brennen) +1 to this. We should probably also make the account creation pathway more obvious from the GitLab instance itself. [21:31:33] dancy: ^ wdyt? [21:33:09] memory usage eqiad down from 3.19TiB to 879GiB. so far not growing again. but maybe we should revert anyways [21:33:21] https://i.imgur.com/tzkkctt.png if you aren't / can't log in to grafana [21:33:28] there were also security patches deployed immediately before the train on 12-09 [21:59:11] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:59:20] * brennen gets caught up on scrollback [21:59:51] brennen: 👋 [22:04:45] eyeballing the graphs, post-restart I would expect the appserver cluster to start running out of memory again in 30-40 hours, so probably sometime Sunday [22:04:49] dancy may be afk at the moment; i'm around for the next couple hours if a rollback over the weekend seems advisable. [22:05:05] i defer to everyone else's judgment, but that kinda sounds like it's advisable. [22:05:14] o/ [22:05:15] (assuming wmf.12 is at fault here.) [22:05:32] (as seems pretty likely) [22:05:44] yeah -- AIUI, I think we don't know for sure whether it's the train or one of the security patches, but if we think it's the train I'd argue for rolling it back [22:05:51] better a Friday than a Sunday [22:06:11] I would agree with that. at least it's 2pm now and not 5 [22:06:16] Agreed [22:06:23] if it's one of the security patches, obviously we don't rollback but hopefully we could fix forward, and I'd argue for doing that today too if possible [22:06:43] but if it's not growing again now.. then how do we know if rollback fixed it [22:06:43] Starting rollback of all wikis to wmf.9 (with sec patch). [22:07:19] mutante: it's growing again, the restart just bought us time [22:07:25] growing still, I should say :) [22:07:34] (03PS1) 10Ahmon Dancy: all wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745955 [22:07:35] ok, ACK ..then..yea [22:07:36] (03CR) 10Ahmon Dancy: [C: 03+2] all wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745955 (owner: 10Ahmon Dancy) [22:07:40] glad we are reverting [22:08:10] dancy: thanks -- and sorry, I know that makes a couple of weeks without a successful train :/ [22:08:13] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.9 refs T293953 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745955 (owner: 10Ahmon Dancy) [22:09:37] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:48] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.9 refs T293953 [22:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:54] T293953: 1.38.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T293953 [22:10:03] okaaaay.. let's see what happens. [22:10:26] thanks [22:10:36] thinking ahead just in case: if that doesn't clear it up, it means one of the security patches is causing the memory leak [22:11:01] we should probably be able to tell in relatively short order, right? [22:11:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:18] brennen: yep, I'm just trying to figure out what to do next in that case [22:11:23] I am praying to the security gods in the meantime [22:11:24] hey, can I help? (with the sec patches or something) [22:11:35] my PHP isn't good enough to try to track down the bug, and l.egoktm is out sick for the day -- aha perfect :) [22:11:49] majavah: :) I think we will know that in a few minutes! but glad you are here [22:12:06] majavah: depends if it goes away after that revert above or not [22:12:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:17] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:58:29] (03PS1) 10RLazarus: trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) [23:00:12] (03CR) 10jerkins-bot: [V: 04-1] trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) (owner: 10RLazarus) [23:00:21] o/ [23:00:32] do you want me to take a look? [23:06:41] (03PS2) 10RLazarus: trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) [23:12:12] (03CR) 10Ladsgroup: [C: 03+1] trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) (owner: 10RLazarus) [23:12:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) (owner: 10RLazarus) [23:15:33] (03CR) 10RLazarus: [C: 03+2] trafficserver: Temporarily disable mwdebug on kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/745963 (https://phabricator.wikimedia.org/T297322) (owner: 10RLazarus) [23:18:23] afk [23:28:16] (03PS1) 10Ladsgroup: Revert "trafficserver: Temporarily disable mwdebug on kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/745873