[00:04:22] (03CR) 10Dzahn: [C: 03+2] scap.cfg.erb: Set require_security_patches: False in beta [puppet] - 10https://gerrit.wikimedia.org/r/1007441 (https://phabricator.wikimedia.org/T350070) (owner: 10Ahmon Dancy) [00:07:39] (03CR) 10Dzahn: [C: 03+2] logstash_checker.py: Handle missing mediawiki_deployments_file [puppet] - 10https://gerrit.wikimedia.org/r/1007449 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [00:10:04] (03CR) 10Dzahn: "puppet compiler fails on this with just "SyntaxError: JSON.parse: unexpected end of data at line 1 column 1 of the JSON data" is there an " [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [00:31:28] 06SRE, 10SRE-Access-Requests: Updating access key - rkhan - https://phabricator.wikimedia.org/T357483#9585882 (10Himejijo) Hi, I still don't have access to mwmaint2002.codfw.wmnet - I'm checking and deleting email addresses from Wikipedia user accounts as a part of Privacy@ so I'll need that access. Thank you! [00:35:29] 06SRE, 10SRE-Access-Requests: Updating access key - rkhan - https://phabricator.wikimedia.org/T357483#9585893 (10Himejijo) When I ssh into mwmaint this is the error I get: ❯ ssh mwmaint2002.codfw.wmnet Enter passphrase for key '/Users/riddy/.ssh/id_ed25519': rkhan@mwmaint2002.codfw.wmnet: Permission denied (p... [00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006879 [00:39:15] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006879 (owner: 10TrainBranchBot) [00:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:47] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host testvm2006.codfw.wmnet with OS bookworm [00:59:47] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host testvm2006.codfw.wmnet [01:01:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006879 (owner: 10TrainBranchBot) [01:29:32] 06SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9585939 (10KFrancis) Hi all, the NDA has been signed. Thanks! [01:49:51] !log ruwiktionary `UPDATE page SET page_namespace=1,page_title=CONCAT('Broken/NS2301:',page_title) WHERE page_id=2469240 AND page_namespace=2301` T31272 [01:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:58] T31272: Implement Gadgets 2.0 - https://phabricator.wikimedia.org/T31272 [01:50:43] !log ruwiktionary `UPDATE page SET page_namespace=1,page_title=CONCAT('Broken/NS2303:',page_title) WHERE page_id=2469241 AND page_namespace=2303; ` T31272 [01:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P58161 and previous config saved to /var/cache/conftool/dbconfig/20240229-021728-ladsgroup.json [02:17:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:25:38] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [02:32:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P58162 and previous config saved to /var/cache/conftool/dbconfig/20240229-023234-ladsgroup.json [02:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P58163 and previous config saved to /var/cache/conftool/dbconfig/20240229-024741-ladsgroup.json [02:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P58164 and previous config saved to /var/cache/conftool/dbconfig/20240229-030247-ladsgroup.json [03:02:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [03:02:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:03:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [03:03:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P58165 and previous config saved to /var/cache/conftool/dbconfig/20240229-030309-ladsgroup.json [03:23:56] (03PS1) 10Andrea Denisse: icinga: Remove stale nagios_group entries to fix [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) [03:25:02] (03PS2) 10Andrea Denisse: icinga: Remove stale nagios_group entries [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) [03:26:45] (03PS4) 10KartikMistry: Section Translation: Add 'nb' in target language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) [03:27:38] (03PS2) 10KartikMistry: Enable Section translation on Wikipedias with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007286 (https://phabricator.wikimedia.org/T351882) [03:28:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:38:08] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1526/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [03:40:09] (03CR) 10Andrea Denisse: "The PCC results show that the correct group is assigned: https://puppet-compiler.wmflabs.org/output/1007467/1526/" [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [03:50:36] (03PS1) 10Andrea Denisse: icinga: Set log group to 'adm' for consistency with other tools [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358540) [03:53:15] 10SRE-swift-storage, 06Commons, 10UploadWizard, 101.42.0-wmf.20; 2024-02-27: Incomplete files uploaded - chunked upload drops last chunk. - https://phabricator.wikimedia.org/T350917#9586045 (10Pppery) [03:56:06] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1527/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [03:57:37] (03PS2) 10Tim Starling: beta: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998626 (https://phabricator.wikimedia.org/T355034) [03:57:52] (03PS2) 10Andrea Denisse: icinga: Set log group to 'adm' for consistency with other tools [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) [04:00:00] (03CR) 10Tim Starling: [C: 03+2] beta: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998626 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [04:00:02] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1007467/1526/" [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [04:00:44] (03Merged) 10jenkins-bot: beta: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998626 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [04:04:01] (03CR) 10Tim Starling: [C: 03+2] Switch block schema to read-old/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006179 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [04:04:45] (03Merged) 10jenkins-bot: Switch block schema to read-old/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006179 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [04:19:17] !log tstarling@deploy2002 Synchronized wmf-config/CommonSettings.php: Switch block schema to read-old/write-both mode T355034 (duration: 08m 47s) [04:19:24] T355034: Deploy new block_target schema - https://phabricator.wikimedia.org/T355034 [04:30:35] !log on mwmaint2002 running migrateBlocks.php on all wikis [04:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P58167 and previous config saved to /var/cache/conftool/dbconfig/20240229-052202-ladsgroup.json [05:22:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:37:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P58168 and previous config saved to /var/cache/conftool/dbconfig/20240229-053708-ladsgroup.json [05:52:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P58169 and previous config saved to /var/cache/conftool/dbconfig/20240229-055215-ladsgroup.json [06:07:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T352010)', diff saved to https://phabricator.wikimedia.org/P58170 and previous config saved to /var/cache/conftool/dbconfig/20240229-060721-ladsgroup.json [06:07:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [06:07:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:07:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [06:19:34] (03PS1) 10Marostegui: db2218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007479 [06:22:02] (03CR) 10Marostegui: [C: 03+2] db2218: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007479 (owner: 10Marostegui) [06:26:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db2218 with 1% weight only T358421 T355422', diff saved to https://phabricator.wikimedia.org/P58171 and previous config saved to /var/cache/conftool/dbconfig/20240229-062601-marostegui.json [06:26:12] T358421: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421 [06:26:12] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [06:27:05] (03PS1) 10Marostegui: instances.yaml: Remove db2118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1007481 (https://phabricator.wikimedia.org/T358421) [06:28:33] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2118 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1007481 (https://phabricator.wikimedia.org/T358421) (owner: 10Marostegui) [06:30:42] 06SRE, 06DBA, 07Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9586106 (10Marostegui) 05Open→03Resolved a:03Marostegui This host will no longer come back to production. I will decommission it in a couple of days. [06:34:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2118 from dbctl', diff saved to https://phabricator.wikimedia.org/P58172 and previous config saved to /var/cache/conftool/dbconfig/20240229-063412-marostegui.json [06:34:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 1%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58173 and previous config saved to /var/cache/conftool/dbconfig/20240229-063420-root.json [06:35:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1034 T358180', diff saved to https://phabricator.wikimedia.org/P58174 and previous config saved to /var/cache/conftool/dbconfig/20240229-063502-root.json [06:35:08] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [06:35:49] (03PS1) 10Marostegui: es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007483 (https://phabricator.wikimedia.org/T358180) [06:35:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on es1034.eqiad.wmnet with reason: Reimage [06:36:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es1034.eqiad.wmnet with reason: Reimage [06:37:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1034.eqiad.wmnet with OS bookworm [06:37:16] (03CR) 10Marostegui: [C: 03+2] es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007483 (https://phabricator.wikimedia.org/T358180) (owner: 10Marostegui) [06:38:27] (03PS1) 10Marostegui: db2118: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1007484 (https://phabricator.wikimedia.org/T358740) [06:39:50] (03CR) 10Marostegui: [C: 03+2] db2118: No longer candidate master [puppet] - 10https://gerrit.wikimedia.org/r/1007484 (https://phabricator.wikimedia.org/T358740) (owner: 10Marostegui) [06:44:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Pool db2218 with 5% weight only', diff saved to https://phabricator.wikimedia.org/P58175 and previous config saved to /var/cache/conftool/dbconfig/20240229-064402-marostegui.json [06:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 5%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58176 and previous config saved to /var/cache/conftool/dbconfig/20240229-064925-root.json [06:50:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1034.eqiad.wmnet with reason: host reimage [06:52:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1034.eqiad.wmnet with reason: host reimage [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T0700) [07:00:05] kormat, marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T0700). Please do the needful. [07:04:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58177 and previous config saved to /var/cache/conftool/dbconfig/20240229-070430-root.json [07:14:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1034.eqiad.wmnet with OS bookworm [07:14:41] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: sretest [07:15:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: sretest [07:19:19] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test1003.wikimedia.org with OS bookworm [07:19:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58178 and previous config saved to /var/cache/conftool/dbconfig/20240229-071935-root.json [07:23:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:41] (03PS1) 10Slyngshede: P:idp switch final prodution host to Java 17. [puppet] - 10https://gerrit.wikimedia.org/r/1007487 (https://phabricator.wikimedia.org/T357748) [07:25:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1528/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007487 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [07:28:40] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:31:44] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:32:31] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [07:32:42] (03PS1) 10Marostegui: Revert "es1034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007372 [07:33:59] (03CR) 10Marostegui: [C: 03+2] Revert "es1034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007372 (owner: 10Marostegui) [07:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58179 and previous config saved to /var/cache/conftool/dbconfig/20240229-073440-root.json [07:34:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58180 and previous config saved to /var/cache/conftool/dbconfig/20240229-073457-root.json [07:35:22] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test1003.wikimedia.org with reason: host reimage [07:35:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote back es1034 to es3 eqiad master T358180', diff saved to https://phabricator.wikimedia.org/P58181 and previous config saved to /var/cache/conftool/dbconfig/20240229-073523-marostegui.json [07:35:32] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [07:48:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58182 and previous config saved to /var/cache/conftool/dbconfig/20240229-074944-root.json [07:50:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58183 and previous config saved to /var/cache/conftool/dbconfig/20240229-075002-root.json [07:51:48] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idp-test1003.wikimedia.org with OS bookworm [07:59:16] (03PS3) 10KartikMistry: Enable Section translation on Wikipedias with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007286 (https://phabricator.wikimedia.org/T351882) [08:00:06] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T0800). [08:00:06] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:16] Aye. /me is here. [08:00:57] You can self serve [08:01:00] Have fun [08:01:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007286 (https://phabricator.wikimedia.org/T351882) (owner: 10KartikMistry) [08:01:44] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:02:18] (03Merged) 10jenkins-bot: Enable Section translation on Wikipedias with Content Translation available as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007286 (https://phabricator.wikimedia.org/T351882) (owner: 10KartikMistry) [08:03:13] !log kartik@deploy2002 Started scap: Backport for [[gerrit:1007286|Enable Section translation on Wikipedias with Content Translation available as default (T351882)]] [08:03:28] T351882: Enable Section translation on Wikipedias with Content Translation available as default - https://phabricator.wikimedia.org/T351882 [08:04:47] !log kartik@deploy2002 kartik: Backport for [[gerrit:1007286|Enable Section translation on Wikipedias with Content Translation available as default (T351882)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:04:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: After optimizing revision table', diff saved to https://phabricator.wikimedia.org/P58184 and previous config saved to /var/cache/conftool/dbconfig/20240229-080449-root.json [08:05:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58185 and previous config saved to /var/cache/conftool/dbconfig/20240229-080507-root.json [08:17:10] !log kartik@deploy2002 kartik: Continuing with sync [08:20:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58186 and previous config saved to /var/cache/conftool/dbconfig/20240229-082012-root.json [08:23:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58187 and previous config saved to /var/cache/conftool/dbconfig/20240229-082356-root.json [08:25:18] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:1007286|Enable Section translation on Wikipedias with Content Translation available as default (T351882)]] (duration: 22m 04s) [08:25:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:25:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:26:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58188 and previous config saved to /var/cache/conftool/dbconfig/20240229-082602-marostegui.json [08:26:07] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:26:09] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [08:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58189 and previous config saved to /var/cache/conftool/dbconfig/20240229-083517-root.json [08:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58190 and previous config saved to /var/cache/conftool/dbconfig/20240229-083901-root.json [08:39:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:39:17] !log kartik@deploy2002 Started scap: Backport for [[gerrit:1007340|Section Translation: Add 'nb' in target language code (T353734)]] [08:39:22] T353734: Fix the mobile experience for a group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T353734 [08:39:22] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:39:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T357189)', diff saved to https://phabricator.wikimedia.org/P58191 and previous config saved to /var/cache/conftool/dbconfig/20240229-083928-arnaudb.json [08:39:34] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:40:48] !log kartik@deploy2002 kartik: Backport for [[gerrit:1007340|Section Translation: Add 'nb' in target language code (T353734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:44:21] !log kartik@deploy2002 kartik: Continuing with sync [08:44:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T357189)', diff saved to https://phabricator.wikimedia.org/P58192 and previous config saved to /var/cache/conftool/dbconfig/20240229-084444-arnaudb.json [08:44:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote back es2029 to es3 codfw master T358180', diff saved to https://phabricator.wikimedia.org/P58193 and previous config saved to /var/cache/conftool/dbconfig/20240229-084502-marostegui.json [08:45:09] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [08:45:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2034 T358180', diff saved to https://phabricator.wikimedia.org/P58194 and previous config saved to /var/cache/conftool/dbconfig/20240229-084541-root.json [08:47:59] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2034.codfw.wmnet with OS bookworm [08:48:02] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58195 and previous config saved to /var/cache/conftool/dbconfig/20240229-085021-root.json [08:52:02] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:1007340|Section Translation: Add 'nb' in target language code (T353734)]] (duration: 12m 45s) [08:52:08] T353734: Fix the mobile experience for a group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T353734 [08:54:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58196 and previous config saved to /var/cache/conftool/dbconfig/20240229-085406-root.json [08:56:53] * kart_ is done with config deployment. [08:59:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P58197 and previous config saved to /var/cache/conftool/dbconfig/20240229-085951-arnaudb.json [09:05:29] (03PS1) 10Marostegui: Revert "es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007373 [09:05:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2034.codfw.wmnet with reason: host reimage [09:07:49] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2156.codfw.wmnet onto db2190.codfw.wmnet [09:08:30] (03PS1) 10Klausman: ML isvcs: change metric used for calculating memory usage [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) [09:08:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2034.codfw.wmnet with reason: host reimage [09:09:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58198 and previous config saved to /var/cache/conftool/dbconfig/20240229-090911-root.json [09:09:21] (03PS2) 10Klausman: ML isvcs: change metric used for calculating memory usage [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) [09:11:06] (03CR) 10Kevin Bazira: [C: 03+1] ML isvcs: change metric used for calculating memory usage [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [09:14:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P58199 and previous config saved to /var/cache/conftool/dbconfig/20240229-091457-arnaudb.json [09:16:09] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:22:58] (03PS1) 10Majavah: P:wmcs::cloudgw: make virt_floating an array [puppet] - 10https://gerrit.wikimedia.org/r/1007566 [09:23:12] (03PS2) 10Majavah: P:wmcs::cloudgw: make virt_floating an array [puppet] - 10https://gerrit.wikimedia.org/r/1007566 [09:23:40] (03CR) 10MVernon: "I'm afraid I don't quite understand the implications of this - changing one "swift" nagios_group into two: "swift_eqiad" and "swift_codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [09:24:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58200 and previous config saved to /var/cache/conftool/dbconfig/20240229-092416-root.json [09:24:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2144.codfw.wmnet,db1201.eqiad.wmnet with reason: Silence for maintenance T356240 [09:24:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2144.codfw.wmnet,db1201.eqiad.wmnet with reason: Silence for maintenance T356240 [09:24:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1529/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007566 (owner: 10Majavah) [09:26:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1530/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007566 (owner: 10Majavah) [09:26:08] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-tool1005.eqiad.wmnet [09:26:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2034.codfw.wmnet with OS bookworm [09:27:47] (03PS1) 10Jaime Nuche: jenkins: add security patch bot token to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/1007323 (https://phabricator.wikimedia.org/T350065) [09:27:49] (03PS1) 10Jaime Nuche: jenkins: add security patch bot token to releases instance secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1007319 (https://phabricator.wikimedia.org/T350065) [09:28:52] 06SRE, 102024.03.04 - 2024.03.24: Update maxmind download to pull databases from new url - https://phabricator.wikimedia.org/T358268#9586346 (10Gehel) [09:28:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depooling for maintenance - reboot', diff saved to https://phabricator.wikimedia.org/P58201 and previous config saved to /var/cache/conftool/dbconfig/20240229-092853-arnaudb.json [09:29:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58202 and previous config saved to /var/cache/conftool/dbconfig/20240229-092913-root.json [09:29:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote back es2034 to es3 codfw master T358180', diff saved to https://phabricator.wikimedia.org/P58203 and previous config saved to /var/cache/conftool/dbconfig/20240229-092929-marostegui.json [09:29:36] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [09:29:51] 07sre-alert-triage, 102024.02.12 - 2024.03.03: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9586369 (10Gehel) p:05Triage→03High [09:29:56] 07sre-alert-triage, 102024.02.12 - 2024.03.03: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9586371 (10Gehel) [09:30:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T357189)', diff saved to https://phabricator.wikimedia.org/P58204 and previous config saved to /var/cache/conftool/dbconfig/20240229-093003-arnaudb.json [09:30:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [09:30:09] (03CR) 10Marostegui: [C: 03+2] Revert "es2034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007373 (owner: 10Marostegui) [09:30:11] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:30:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [09:30:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T357189)', diff saved to https://phabricator.wikimedia.org/P58205 and previous config saved to /var/cache/conftool/dbconfig/20240229-093025-arnaudb.json [09:30:45] (03PS1) 10Ayounsi: provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) [09:31:57] (03PS1) 10Slyngshede: Revert "C:tomcat Allow users to specify which version of Tomcat to install." [puppet] - 10https://gerrit.wikimedia.org/r/1007374 [09:32:24] (03CR) 10Volans: [C: 03+1] "I don't mind this but I don't think is a great long term solution" [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) (owner: 10Ayounsi) [09:33:08] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9586379 (10Gehel) p:05Triage→03High [09:33:31] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9586384 (10Gehel) [09:33:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (and let's disable Puppet on the prod IDPs for controlled rollout)" [puppet] - 10https://gerrit.wikimedia.org/r/1007374 (owner: 10Slyngshede) [09:34:00] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [09:34:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1531/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007374 (owner: 10Slyngshede) [09:35:00] (03CR) 10CI reject: [V: 04-1] provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) (owner: 10Ayounsi) [09:35:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T357189)', diff saved to https://phabricator.wikimedia.org/P58206 and previous config saved to /var/cache/conftool/dbconfig/20240229-093543-arnaudb.json [09:35:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1532/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007374 (owner: 10Slyngshede) [09:35:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:36:21] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1201.eqiad.wmnet [09:37:06] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Revert "C:tomcat Allow users to specify which version of Tomcat to install." [puppet] - 10https://gerrit.wikimedia.org/r/1007374 (owner: 10Slyngshede) [09:38:59] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks! This would have been useful to have multiple times in the past already" [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) (owner: 10Ayounsi) [09:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58207 and previous config saved to /var/cache/conftool/dbconfig/20240229-093921-root.json [09:41:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1201.eqiad.wmnet [09:42:54] (03PS21) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [09:44:06] (03PS22) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [09:44:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [09:44:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58208 and previous config saved to /var/cache/conftool/dbconfig/20240229-094418-root.json [09:44:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [09:44:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T352010)', diff saved to https://phabricator.wikimedia.org/P58209 and previous config saved to /var/cache/conftool/dbconfig/20240229-094429-ladsgroup.json [09:44:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:44:56] (03PS2) 10Ayounsi: provision cookbook add warning for virt hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1007567 (https://phabricator.wikimedia.org/T344342) [09:47:12] jouncebot: next [09:47:12] In 1 hour(s) and 12 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1100) [09:47:12] In 1 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1100) [09:47:21] (03CR) 10Btullis: [C: 03+1] "Looks good. The CI failure is just a missing keytab in the labs/private repo, which we can fix separately." [puppet] - 10https://gerrit.wikimedia.org/r/1007327 (owner: 10Muehlenhoff) [09:47:23] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: revert to standard logging level [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [09:49:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58210 and previous config saved to /var/cache/conftool/dbconfig/20240229-094915-arnaudb.json [09:49:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 reboot', diff saved to https://phabricator.wikimedia.org/P58211 and previous config saved to /var/cache/conftool/dbconfig/20240229-094945-arnaudb.json [09:49:55] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1004080 (owner: 10Majavah) [09:50:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P58212 and previous config saved to /var/cache/conftool/dbconfig/20240229-095049-arnaudb.json [09:51:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db1224.eqiad.wmnet with reason: Silence for maintenance T356240 [09:51:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db1224.eqiad.wmnet with reason: Silence for maintenance T356240 [09:51:41] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1224.eqiad.wmnet [09:54:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P58213 and previous config saved to /var/cache/conftool/dbconfig/20240229-095425-root.json [09:55:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1224.eqiad.wmnet [09:55:45] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9586432 (10ayounsi) One possible path forward is to work with Dell's support to solve {T304483} Another one is to narrow down the chara... [09:56:33] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 101 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [09:57:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58214 and previous config saved to /var/cache/conftool/dbconfig/20240229-095709-arnaudb.json [09:57:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2117.codfw.wmnet with reason: Silence for maintenance T356240 [09:58:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2117.codfw.wmnet with reason: Silence for maintenance T356240 [09:59:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 ', diff saved to https://phabricator.wikimedia.org/P58215 and previous config saved to /var/cache/conftool/dbconfig/20240229-095918-arnaudb.json [09:59:22] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f25b]: Additional analytics weekly train [analytics/refinery@6e8f25b3] [09:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58216 and previous config saved to /var/cache/conftool/dbconfig/20240229-095923-root.json [10:01:34] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [10:04:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58217 and previous config saved to /var/cache/conftool/dbconfig/20240229-100421-arnaudb.json [10:05:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P58218 and previous config saved to /var/cache/conftool/dbconfig/20240229-100556-arnaudb.json [10:06:08] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:analytics: gitconfig: rely on http_proxy hiera variable [puppet] - 10https://gerrit.wikimedia.org/r/1004080 (owner: 10Majavah) [10:06:47] (03CR) 10Filippo Giunchedi: "I have reviewed the patch and the situation, I'm not 100% sure about the side effects of running icinga as group adm, though the logs are " [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [10:09:02] (03PS1) 10Slyngshede: C:tomcat10 New modules for Tomcat10 [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) [10:10:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "To provide some context, I believe this change was missed when we effectively moved to a default of "cluster_site" for nagios group. The s" [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [10:11:02] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f25b]: Additional analytics weekly train [analytics/refinery@6e8f25b3] (duration: 11m 39s) [10:11:26] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f25b] (thin): Additional analytics weekly train - THIN [analytics/refinery@6e8f25b3] [10:11:32] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f25b] (thin): Additional analytics weekly train - THIN [analytics/refinery@6e8f25b3] (duration: 00m 05s) [10:12:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58219 and previous config saved to /var/cache/conftool/dbconfig/20240229-101214-arnaudb.json [10:13:19] !log joal@deploy2002 Started deploy [analytics/refinery@6e8f25b] (hadoop-test): Additional analytics weekly train - TEST [analytics/refinery@6e8f25b3] [10:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58220 and previous config saved to /var/cache/conftool/dbconfig/20240229-101427-root.json [10:14:34] (03CR) 10Majavah: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [10:15:11] (03CR) 10Muehlenhoff: C:tomcat10 New modules for Tomcat10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:15:50] (03CR) 10Majavah: cloudgw: filtering traffic routing between VMs and cloud vrf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007007 (https://phabricator.wikimedia.org/T356986) (owner: 10Cathal Mooney) [10:16:48] (03PS2) 10Slyngshede: C:tomcat10 New modules for Tomcat10 [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) [10:17:01] !log joal@deploy2002 Finished deploy [analytics/refinery@6e8f25b] (hadoop-test): Additional analytics weekly train - TEST [analytics/refinery@6e8f25b3] (duration: 03m 41s) [10:18:02] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1534/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:18:57] (03CR) 10Muehlenhoff: [C: 03+2] airflow: Add option to pass the firewall settings via firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1007327 (owner: 10Muehlenhoff) [10:19:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58221 and previous config saved to /var/cache/conftool/dbconfig/20240229-101926-arnaudb.json [10:20:08] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:21:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T357189)', diff saved to https://phabricator.wikimedia.org/P58222 and previous config saved to /var/cache/conftool/dbconfig/20240229-102102-arnaudb.json [10:21:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [10:21:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:21:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1196.eqiad.wmnet with reason: Maintenance [10:21:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:21:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:21:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T357189)', diff saved to https://phabricator.wikimedia.org/P58223 and previous config saved to /var/cache/conftool/dbconfig/20240229-102143-arnaudb.json [10:24:43] !log Cordoning kubernetes2023.codfw.wmnet for vlan change cookbook tests - T350152 [10:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] T350152: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 [10:24:51] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [10:26:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1005.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin1002" [10:26:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:26:39] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-tool1005.eqiad.wmnet [10:26:53] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Let's try" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [10:26:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T357189)', diff saved to https://phabricator.wikimedia.org/P58224 and previous config saved to /var/cache/conftool/dbconfig/20240229-102656-arnaudb.json [10:27:02] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:27:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58225 and previous config saved to /var/cache/conftool/dbconfig/20240229-102719-arnaudb.json [10:27:34] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations, 13Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152#9586525 (10Clement_Goubert) @ayounsi I've drained `kubernetes2023.codfw.wmnet` for you to test the cookbook [10:27:52] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations, 06serviceops-radar, 13Patch-For-Review: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152#9586526 (10Clement_Goubert) [10:29:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58226 and previous config saved to /var/cache/conftool/dbconfig/20240229-102932-root.json [10:30:36] (03PS3) 10Slyngshede: C:tomcat10 New modules for Tomcat10 [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) [10:31:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1535/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:32:26] (03CR) 10MVernon: [C: 03+1] "Thanks for the explanation :)" [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [10:32:39] (03PS1) 10Muehlenhoff: airflow/search: Pass firewall range in profile::firewall syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007574 [10:33:01] (03CR) 10Elukey: "Tobias I think you are on the right path, I see that serviceops uses the same metric: https://gerrit.wikimedia.org/r/c/operations/alerts/+" [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [10:34:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1201 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58227 and previous config saved to /var/cache/conftool/dbconfig/20240229-103431-arnaudb.json [10:34:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "PCC checks out" [puppet] - 10https://gerrit.wikimedia.org/r/1007299 (https://phabricator.wikimedia.org/T358647) (owner: 10Fabfur) [10:35:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:35:06] (03CR) 10Klausman: "Looking at that patch, it looks we already are using those: the deploy: tag at the top of the file would indicate thus. I'll poke the k8s " [alerts] - 10https://gerrit.wikimedia.org/r/1007564 (https://phabricator.wikimedia.org/T358742) (owner: 10Klausman) [10:35:08] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:35:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007574 (owner: 10Muehlenhoff) [10:40:07] (03PS4) 10Slyngshede: C:tomcat10 New modules for Tomcat10 [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) [10:41:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1537/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:42:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P58228 and previous config saved to /var/cache/conftool/dbconfig/20240229-104202-arnaudb.json [10:42:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58229 and previous config saved to /var/cache/conftool/dbconfig/20240229-104223-arnaudb.json [10:42:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1538/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [10:44:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2034 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58230 and previous config saved to /var/cache/conftool/dbconfig/20240229-104437-root.json [10:53:19] (03CR) 10EoghanGaffney: [C: 03+2] jenkins: add security patch bot token to releases instance secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1007319 (https://phabricator.wikimedia.org/T350065) (owner: 10Jaime Nuche) [10:53:23] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] jenkins: add security patch bot token to releases instance secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1007319 (https://phabricator.wikimedia.org/T350065) (owner: 10Jaime Nuche) [10:53:55] (03CR) 10EoghanGaffney: [C: 03+1] jenkins: add security patch bot token to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/1007323 (https://phabricator.wikimedia.org/T350065) (owner: 10Jaime Nuche) [10:54:02] (03CR) 10EoghanGaffney: [C: 03+2] jenkins: add security patch bot token to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/1007323 (https://phabricator.wikimedia.org/T350065) (owner: 10Jaime Nuche) [10:57:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P58231 and previous config saved to /var/cache/conftool/dbconfig/20240229-105708-arnaudb.json [10:57:36] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1540/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007299 (https://phabricator.wikimedia.org/T358647) (owner: 10Fabfur) [10:59:47] 06SRE, 06Content-Transform-Team, 10MW-on-K8s, 06Traffic, and 2 others: Reimage parse* hosts as kubernetes nodes - https://phabricator.wikimedia.org/T358752 (10akosiaris) [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1100). nyaa~ [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1100) [11:00:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007566 (owner: 10Majavah) [11:03:15] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:04:14] (03Merged) 10jenkins-bot: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:04:20] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:04:42] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:04:58] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:05:20] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:05:42] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:05:58] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:06:25] (03PS1) 10Btullis: Allow the systemd-timer-mail-wrapper to override the sender [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [11:06:27] (03PS1) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [11:06:29] (03PS1) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [11:07:04] (03CR) 10Arturo Borrero Gonzalez: P:openstack: rabbitmq: use firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [11:08:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1007293 (owner: 10Majavah) [11:09:10] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:wmcs::cloudgw: make virt_floating an array [puppet] - 10https://gerrit.wikimedia.org/r/1007566 (owner: 10Majavah) [11:10:27] (03CR) 10Arturo Borrero Gonzalez: "LGTM, except a minor comment." [puppet] - 10https://gerrit.wikimedia.org/r/1007294 (owner: 10Majavah) [11:11:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2156.codfw.wmnet onto db2190.codfw.wmnet [11:12:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T357189)', diff saved to https://phabricator.wikimedia.org/P58232 and previous config saved to /var/cache/conftool/dbconfig/20240229-111215-arnaudb.json [11:12:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:12:21] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:12:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:12:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T357189)', diff saved to https://phabricator.wikimedia.org/P58233 and previous config saved to /var/cache/conftool/dbconfig/20240229-111247-arnaudb.json [11:14:27] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] P:openstack: rabbitmq: remove cinder-backups term (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007295 (owner: 10Majavah) [11:14:38] (03CR) 10Slyngshede: [V: 03+1] C:tomcat10 New modules for Tomcat10 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [11:15:04] (03CR) 10Slyngshede: "This isn't actually the full blown exporter, that would be https://github.com/ncabatoff/process-exporter, this one just piggy backs on the" [puppet] - 10https://gerrit.wikimedia.org/r/1004061 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:15:24] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tomcat10 New modules for Tomcat10 [puppet] - 10https://gerrit.wikimedia.org/r/1007569 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [11:17:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T357189)', diff saved to https://phabricator.wikimedia.org/P58234 and previous config saved to /var/cache/conftool/dbconfig/20240229-111753-arnaudb.json [11:18:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:18:32] (03PS1) 10Brouberol: superset: allow usage of DATA_CACHE_STORAGE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007580 (https://phabricator.wikimedia.org/T358753) [11:20:03] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007562 (https://phabricator.wikimedia.org/T354758) (owner: 10Kosta Harlan) [11:21:00] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007562 (https://phabricator.wikimedia.org/T354758) (owner: 10Kosta Harlan) [11:22:48] (03PS2) 10Brouberol: superset: allow usage of DATA_CACHE_STORAGE [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007580 (https://phabricator.wikimedia.org/T358753) [11:22:50] (03PS1) 10Brouberol: superset-next: enable data cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007581 (https://phabricator.wikimedia.org/T358753) [11:22:52] (03PS1) 10Brouberol: superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) [11:25:37] (03PS3) 10Brouberol: superset: allow usage of DATA_CACHE_CONFIG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007580 (https://phabricator.wikimedia.org/T358753) [11:25:39] (03PS2) 10Brouberol: superset-next: enable data cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007581 (https://phabricator.wikimedia.org/T358753) [11:25:43] (03PS2) 10Brouberol: superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) [11:26:26] (03CR) 10Muehlenhoff: "This looks good, but we can probably just make this the default behaviour and send all systemd timer mails from something like noreply@wik" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:26:44] (03CR) 10Joal: [C: 03+1] superset: allow usage of DATA_CACHE_CONFIG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007580 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:27:12] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [11:27:13] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [11:28:28] (03PS3) 10Brouberol: superset-next: enable data cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007581 (https://phabricator.wikimedia.org/T358753) [11:28:30] (03PS3) 10Brouberol: superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) [11:28:40] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:30:51] (03CR) 10Joal: [C: 03+1] superset-next: enable data cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007581 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:30:53] (03CR) 10Btullis: "Personally, I'd be fine with that approach too, but I was worried about changing the default behaviour and breaking other people's stuff." [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [11:31:31] (03CR) 10Joal: [C: 03+1] superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:31:35] (03CR) 10Brouberol: [C: 03+2] superset: allow usage of DATA_CACHE_CONFIG [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007580 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:31:47] (03CR) 10Brouberol: [C: 03+2] superset-next: enable data cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007581 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:33:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P58235 and previous config saved to /var/cache/conftool/dbconfig/20240229-113259-arnaudb.json [11:34:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:35:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:36:18] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [11:36:19] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [11:37:51] (03PS1) 10Clément Goubert: mw-api-int: Increase replicas to 240 total [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007584 (https://phabricator.wikimedia.org/T356497) [11:39:30] (03PS1) 10Muehlenhoff: Sync context.xml to default template from 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007585 (https://phabricator.wikimedia.org/T357748) [11:44:21] (03CR) 10Brouberol: [C: 03+2] superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:44:30] (03CR) 10CI reject: [V: 04-1] superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:44:42] (03PS4) 10Brouberol: superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) [11:45:58] (03CR) 10Brouberol: [V: 03+2 C: 03+2] superset-next: enable per-user query cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007582 (https://phabricator.wikimedia.org/T358753) (owner: 10Brouberol) [11:47:18] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:47:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:48:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P58236 and previous config saved to /var/cache/conftool/dbconfig/20240229-114806-arnaudb.json [11:50:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:50:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:55:07] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [11:55:08] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [11:55:19] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:55:20] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [12:00:03] !log kharlan@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:00:33] !log kharlan@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:01:50] (03PS2) 10Muehlenhoff: Sync context.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007585 (https://phabricator.wikimedia.org/T357748) [12:01:54] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:02:24] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9586760 (10dr0ptp4kt) Hi team - @lbowmaker asked if I could take a look at this and provide some context. I was having a think on this, and I'd like to pond... [12:02:34] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:03:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T357189)', diff saved to https://phabricator.wikimedia.org/P58238 and previous config saved to /var/cache/conftool/dbconfig/20240229-120312-arnaudb.json [12:03:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:03:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:03:29] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:03:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T357189)', diff saved to https://phabricator.wikimedia.org/P58239 and previous config saved to /var/cache/conftool/dbconfig/20240229-120335-arnaudb.json [12:03:55] !log kharlan@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [12:04:17] !log kharlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [12:04:41] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007585 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [12:09:16] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:09:33] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:09:34] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:10:18] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:10:18] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9586779 (10cmooney) 05Open→03Resolved a:03cmooney Closing task - thanks all for the co-operation! [12:10:26] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9586782 (10cmooney) [12:10:33] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:10:36] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:12:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T352010)', diff saved to https://phabricator.wikimedia.org/P58240 and previous config saved to /var/cache/conftool/dbconfig/20240229-121202-ladsgroup.json [12:12:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:14:29] !log cmooney@cumin1002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [12:14:39] 06SRE, 10SRE-swift-storage, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872#9586789 (10ops-monitoring-bot) Draining ganeti2032.codfw.wmnet of running VMs [12:15:16] (03CR) 10Muehlenhoff: [C: 03+2] Sync context.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007585 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [12:16:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [12:20:33] (03PS1) 10Slyngshede: C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) [12:27:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P58242 and previous config saved to /var/cache/conftool/dbconfig/20240229-122709-ladsgroup.json [12:38:59] (03PS1) 10Muehlenhoff: Sync web.xml to default template from Tomcat 10/Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) [12:42:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P58244 and previous config saved to /var/cache/conftool/dbconfig/20240229-124215-ladsgroup.json [12:46:25] (03PS2) 10Slyngshede: C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) [12:47:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1543/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [12:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:59] (03CR) 10Jelto: [C: 03+2] prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [12:54:13] (03CR) 10Jelto: [C: 03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [12:57:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T352010)', diff saved to https://phabricator.wikimedia.org/P58245 and previous config saved to /var/cache/conftool/dbconfig/20240229-125723-ladsgroup.json [12:57:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [12:57:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:57:32] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9586848 (10cmooney) After a quick discussion on irc I think we can't wipe the config for every unit in the VC over ssh to the master. So probably easiest to do that via se... [12:57:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1300) [13:04:44] (03PS3) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [13:06:23] (03CR) 10Elukey: "Tobias: I have reworked completely the storage-initializer to work with Bookworm and python 3.11, lemme know if you like the way to go or " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [13:11:20] (03CR) 10Muehlenhoff: "Looks good, two nits/comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:38:52] (03CR) 10Jelto: [C: 03+2] "After merging this all etherpad metrics are missing and no instance is being scraped anymore." [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:40:44] (03PS1) 10Majavah: P:prometheus: fix etherpad scraping [puppet] - 10https://gerrit.wikimedia.org/r/1007610 [13:41:15] (03CR) 10Majavah: "This is a follow-up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006523." [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [13:41:19] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9586936 (10ayounsi) As far as I know the MB serial number is the most convenient unique identifier we can use as it's on the chassis for most of the devices and we can query it in a programmatic way... [13:43:05] (03CR) 10Majavah: "Ah, I missed that currently `class_name` is the role but the parameter is on the profile. Sent https://gerrit.wikimedia.org/r/c/operations" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:43:43] (03PS1) 10Muehlenhoff: Remove access for jfishback [puppet] - 10https://gerrit.wikimedia.org/r/1007611 [13:43:49] (03PS2) 10Anzx: cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007375 (https://phabricator.wikimedia.org/T358755) [13:45:00] (03CR) 10CI reject: [V: 04-1] Remove access for jfishback [puppet] - 10https://gerrit.wikimedia.org/r/1007611 (owner: 10Muehlenhoff) [13:45:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [13:45:29] (03PS3) 10Anzx: cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007375 (https://phabricator.wikimedia.org/T358755) [13:46:45] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks for the quick fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [13:47:10] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:prometheus: fix etherpad scraping [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [13:47:33] (03CR) 10Jelto: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1545/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [13:48:43] (03PS1) 10Ayounsi: [WIP] netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [13:49:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1546/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:49:55] (03PS2) 10Muehlenhoff: Remove access for jfishback [puppet] - 10https://gerrit.wikimedia.org/r/1007611 [13:52:42] (03PS1) 10Lucas Werkmeister (WMDE): Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) [13:55:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jfishback [puppet] - 10https://gerrit.wikimedia.org/r/1007611 (owner: 10Muehlenhoff) [13:55:43] (03PS1) 10Effie Mouzeli: mw-mcrouter: fix helmfile typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007617 [13:58:17] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006882 [13:58:53] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: fix helmfile typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007617 (owner: 10Effie Mouzeli) [13:59:30] (03PS1) 10Brouberol: superset: include all configuration checksums into the Deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) [13:59:33] (03PS3) 10Slyngshede: C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) [13:59:44] (03CR) 10Slyngshede: C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [13:59:58] (03Merged) 10jenkins-bot: mw-mcrouter: fix helmfile typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007617 (owner: 10Effie Mouzeli) [14:00:05] (03CR) 10Jelto: [V: 03+1 C: 03+1] "metrics for the production instance etherpad1004 are back and I don't see alerts for the etherpad replicas/standby host. So that works now" [puppet] - 10https://gerrit.wikimedia.org/r/1007610 (owner: 10Majavah) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1400). [14:00:05] anzx and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:26] 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763 (10BTullis) [14:00:28] * urbanecm waves [14:00:37] but since Lucas_WMDE has a patch, i presume they'd do the honors? [14:00:38] I’m around but might have to wait 20-30 minutes first before I’m free [14:00:55] 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9587016 (10BTullis) [14:00:58] (it’s also possible that I’ll be free in a few minutes. depends if anyone else shows up in this meeting ^^) [14:01:06] o/ [14:01:08] hehe [14:01:11] i'll start then? [14:01:21] (03CR) 10Urbanecm: [C: 03+2] cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007375 (https://phabricator.wikimedia.org/T358755) (owner: 10Anzx) [14:01:24] please do [14:01:45] Lucas_WMDE: do you want me to +2 your backport, to save on CI time? or rather not, as it's not clear when you'd be available? [14:01:49] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:01:50] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:01:54] I think it’s fine to +2 it already [14:02:00] (03CR) 10Urbanecm: [C: 03+2] Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) (owner: 10Lucas Werkmeister (WMDE)) [14:02:05] thanks! [14:02:05] (03PS2) 10Brouberol: superset: include all configuration checksums into the Deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) [14:02:08] (03Merged) 10jenkins-bot: cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007375 (https://phabricator.wikimedia.org/T358755) (owner: 10Anzx) [14:02:13] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:02:14] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:02:40] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jfishback out of all services on: 8 hosts [14:02:50] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jfishback out of all services on: 8 hosts [14:02:51] (03CR) 10Brouberol: [C: 03+1] Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [14:02:58] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:1007375|cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon (T358755)]] [14:03:00] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [14:03:05] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [14:03:05] T358755: Lift IP cap for editathon Women in Science - https://phabricator.wikimedia.org/T358755 [14:03:50] (03CR) 10Brouberol: [C: 03+1] Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [14:04:02] * Lucas_WMDE is free now, it turns out :) [14:04:30] !log urbanecm@deploy2002 anzx and urbanecm: Backport for [[gerrit:1007375|cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon (T358755)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:41] !log urbanecm@deploy2002 anzx and urbanecm: Continuing with sync [14:04:43] urbanecm: nothing to test, please sync [14:04:47] doing doing :) [14:04:51] Lucas_WMDE: just finishing the throttle rule [14:04:58] feel free to take over when scap finishes [14:05:03] ok! [14:05:41] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007603 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [14:11:08] (03CR) 10CI reject: [V: 04-1] Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) (owner: 10Lucas Werkmeister (WMDE)) [14:11:16] * Lucas_WMDE looks -.- [14:12:34] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused Phan suppression [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007377 [14:12:41] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:1007375|cswiki, commonswiki, enwiki: Lift IP cap for Women in Science Editathon (T358755)]] (duration: 09m 42s) [14:12:42] (03PS2) 10Lucas Werkmeister (WMDE): Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) [14:12:47] T358755: Lift IP cap for editathon Women in Science - https://phabricator.wikimedia.org/T358755 [14:13:35] urbanecm: I’ll go ahead with my changes then [14:13:40] go ahead [14:14:03] !log lucaswerkmeister-wmde@deploy2002 backport Canceled [14:14:08] urbanecm: thank you [14:14:18] anzx: no problem :) [14:14:35] (03PS2) 10Lucas Werkmeister (WMDE): Remove unused Phan suppression [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007377 [14:14:54] (03CR) 10Lucas Werkmeister (WMDE): "Removed the Depends-On, the referenced change was merged on Wikibase before the branch cut and apparently `scap backport` had trouble reco" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007377 (owner: 10Lucas Werkmeister (WMDE)) [14:15:00] (03PS3) 10Lucas Werkmeister (WMDE): Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) [14:15:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007377 (owner: 10Lucas Werkmeister (WMDE)) [14:15:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) (owner: 10Lucas Werkmeister (WMDE)) [14:22:42] (03CR) 10Muehlenhoff: "I don't anyone is really settled on the current format. Back then when I added the wrapper to allow porting cron jobs (since cron has that" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [14:25:27] (03PS1) 10Elukey: kserve: update default Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007624 (https://phabricator.wikimedia.org/T337213) [14:26:06] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idp switch final prodution host to Java 17. [puppet] - 10https://gerrit.wikimedia.org/r/1007487 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [14:30:39] (03PS1) 10Effie Mouzeli: mw-mcrouter: add values-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007625 [14:32:23] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9587186 (10Jhancock.wm) [14:32:44] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343#9587187 (10Jhancock.wm) [14:33:36] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9587193 (10phaultfinder) [14:33:41] (03Merged) 10jenkins-bot: Remove unused Phan suppression [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007377 (owner: 10Lucas Werkmeister (WMDE)) [14:34:03] (03Merged) 10jenkins-bot: Bump special-new-lexeme, fix redirect without temp user [extensions/WikibaseLexeme] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007376 (https://phabricator.wikimedia.org/T358754) (owner: 10Lucas Werkmeister (WMDE)) [14:34:06] yay [14:34:32] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1007377|Remove unused Phan suppression]], [[gerrit:1007376|Bump special-new-lexeme, fix redirect without temp user (T358754)]] [14:34:38] T358754: Special:NewLexeme no longer redirects to created lexeme (unless a temporary account was created) - https://phabricator.wikimedia.org/T358754 [14:34:58] (03PS2) 10Effie Mouzeli: mw-mcrouter: add values-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007625 [14:36:05] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1007377|Remove unused Phan suppression]], [[gerrit:1007376|Bump special-new-lexeme, fix redirect without temp user (T358754)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:14] testing… [14:36:29] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: add values-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007625 (owner: 10Effie Mouzeli) [14:36:34] seems to work [14:36:35] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync [14:37:24] (03Merged) 10jenkins-bot: mw-mcrouter: add values-staging.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007625 (owner: 10Effie Mouzeli) [14:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:24] (03PS12) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [14:40:46] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:41:19] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [14:41:20] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [14:41:46] (03CR) 10Klausman: [C: 03+1] "- I think decoupling Py version in the storage init from what we run elsewhere in the same pod is fine." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [14:44:41] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1007377|Remove unused Phan suppression]], [[gerrit:1007376|Bump special-new-lexeme, fix redirect without temp user (T358754)]] (duration: 10m 08s) [14:44:47] T358754: Special:NewLexeme no longer redirects to created lexeme (unless a temporary account was created) - https://phabricator.wikimedia.org/T358754 [14:44:52] !log UTC afternoon backport+config window done [14:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:05] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9587275 (10Jhancock.wm) @cmooney they're on the old asw switches. Let me know when you want to move them back to the new lsw. [14:51:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58246 and previous config saved to /var/cache/conftool/dbconfig/20240229-145125-marostegui.json [14:51:32] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [14:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:38] (03CR) 10Marostegui: [C: 03+2] Revert "db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007277 (owner: 10Marostegui) [15:01:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 5%: After recloning', diff saved to https://phabricator.wikimedia.org/P58247 and previous config saved to /var/cache/conftool/dbconfig/20240229-150105-root.json [15:02:45] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:51] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [15:05:27] (03PS1) 10Effie Mouzeli: mw-mcrouter: helmfile copypasta tidy up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007632 [15:06:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58248 and previous config saved to /var/cache/conftool/dbconfig/20240229-150632-marostegui.json [15:06:37] (03CR) 10Btullis: "OK, thanks for the additional context. How about if we update the default to be noreply@wikimedia.org and still keep the ability to overri" [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [15:08:31] (JobUnavailable) firing: (2) Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:15:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1213.eqiad.wmnet with reason: Maint test [15:15:24] (03CR) 10Muehlenhoff: "https://phabricator.wikimedia.org/P58249 is a diff between the Debian default and our latest template. With that, we have nicely represent" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [15:15:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1213.eqiad.wmnet with reason: Maint test [15:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 10%: After recloning', diff saved to https://phabricator.wikimedia.org/P58250 and previous config saved to /var/cache/conftool/dbconfig/20240229-151610-root.json [15:17:17] (03CR) 10Kamila Součková: [C: 03+1] mw-api-int: Increase replicas to 240 total [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007584 (https://phabricator.wikimedia.org/T356497) (owner: 10Clément Goubert) [15:19:38] (03PS2) 10Effie Mouzeli: mw-mcrouter: helmfile copypasta tidy up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007632 [15:21:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P58251 and previous config saved to /var/cache/conftool/dbconfig/20240229-152139-marostegui.json [15:23:22] (03CR) 10Muehlenhoff: "Sure, sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [15:28:40] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:29:24] PROBLEM - Thanos swift https on thanos-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [15:29:30] (03CR) 10Volans: "Approach LGTM, nit inline and missing tests :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [15:30:16] RECOVERY - Thanos swift https on thanos-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Thanos [15:31:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 25%: After recloning', diff saved to https://phabricator.wikimedia.org/P58252 and previous config saved to /var/cache/conftool/dbconfig/20240229-153115-root.json [15:33:31] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T354015)', diff saved to https://phabricator.wikimedia.org/P58253 and previous config saved to /var/cache/conftool/dbconfig/20240229-153646-marostegui.json [15:36:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:36:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:36:53] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [15:36:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T354015)', diff saved to https://phabricator.wikimedia.org/P58254 and previous config saved to /var/cache/conftool/dbconfig/20240229-153658-marostegui.json [15:39:58] (03PS1) 10FNegri: [wmcs-backup] Don't backup temp toolsdb volumes [puppet] - 10https://gerrit.wikimedia.org/r/1007636 (https://phabricator.wikimedia.org/T356904) [15:40:06] !log swfrench@cumin2002 dbctl commit (dc=all): 'Depooling db1213 for exercise', diff saved to https://phabricator.wikimedia.org/P58255 and previous config saved to /var/cache/conftool/dbconfig/20240229-154005-swfrench.json [15:43:32] !log installing tar security updates [15:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:01] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: helmfile copypasta tidy up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007632 (owner: 10Effie Mouzeli) [15:45:53] (03Merged) 10jenkins-bot: mw-mcrouter: helmfile copypasta tidy up [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007632 (owner: 10Effie Mouzeli) [15:46:16] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [15:46:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 50%: After recloning', diff saved to https://phabricator.wikimedia.org/P58256 and previous config saved to /var/cache/conftool/dbconfig/20240229-154619-root.json [15:48:06] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [15:48:12] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [15:49:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 10%: Excersie over', diff saved to https://phabricator.wikimedia.org/P58257 and previous config saved to /var/cache/conftool/dbconfig/20240229-154944-root.json [15:50:59] swfrench-wmf: See ^ [15:51:44] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/1007449 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [15:52:02] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [15:52:29] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [15:56:17] Amir1: thanks! [15:56:28] (03PS4) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [15:57:05] (03CR) 10Elukey: "After a chat with Tobias I found a way to avoid pipx and use venvs directly, without changing the nobody user." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [15:57:33] (03PS5) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [15:59:45] !log configuring lsw1-b7-codfw in advance of server migration T355872 [15:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:51] T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 [16:01:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 75%: After recloning', diff saved to https://phabricator.wikimedia.org/P58258 and previous config saved to /var/cache/conftool/dbconfig/20240229-160124-root.json [16:01:26] (03PS1) 10Effie Mouzeli: mw-mcrouter: add service stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007641 [16:01:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b7-codfw with reason: prepping for server uplink migration codfw rack b7 [16:01:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b7-codfw with reason: prepping for server uplink migration codfw rack b7 [16:01:48] 06SRE, 10SRE-swift-storage, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872#9587603 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ab1a9b14-3187-4d52-a4f6-be3c445a8081... [16:01:56] (03CR) 10Slyngshede: [C: 03+1] "Make more sense." [puppet] - 10https://gerrit.wikimedia.org/r/1007258 (owner: 10Muehlenhoff) [16:02:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 9 hosts with reason: Migrating servers in codfw rack B7 to lsw1-b7-codfw [16:02:19] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007259 (https://phabricator.wikimedia.org/T357749) (owner: 10Muehlenhoff) [16:02:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 9 hosts with reason: Migrating servers in codfw rack B7 to lsw1-b7-codfw [16:02:42] 06SRE, 10SRE-swift-storage, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872#9587615 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=12cd3c2a-9d8e-4ba6-a42e-1faa167de80d... [16:03:02] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:04:00] (03PS1) 10Brouberol: superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) [16:04:30] (03PS1) 10Lucas Werkmeister (WMDE): termbox: update to 2024-02-23-144805-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007643 (https://phabricator.wikimedia.org/T357404) [16:04:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 25%: Excersie over', diff saved to https://phabricator.wikimedia.org/P58259 and previous config saved to /var/cache/conftool/dbconfig/20240229-160449-root.json [16:05:06] (03CR) 10Elukey: [C: 04-1] "I don't particularly love the idea of having the python venv in the final image of the storage container, will try to find something bette" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [16:05:22] !log Commencing network maintenance migrating servers to new switch codfw rack B7 T355872 [16:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:29] (03PS3) 10Brouberol: superset: include all configuration checksums into the Deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) [16:05:31] (03PS2) 10Brouberol: superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) [16:05:31] T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 [16:06:11] (03CR) 10Brouberol: [C: 03+1] airflow/search: Pass firewall range in profile::firewall syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007574 (owner: 10Muehlenhoff) [16:06:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:06:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:08:21] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: add service stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007641 (owner: 10Effie Mouzeli) [16:09:00] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9587653 (10KFrancis) Hello all, the NDA has been signed, thanks! [16:09:24] (03Merged) 10jenkins-bot: mw-mcrouter: add service stanza [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007641 (owner: 10Effie Mouzeli) [16:09:53] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [16:10:04] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [16:10:45] 06SRE, 10SRE-swift-storage, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872#9587654 (10cmooney) All hosts moved sucessfully. Showing up on switch, macs learnt and all responding to ping a... [16:12:44] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [16:12:46] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [16:13:02] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:16:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2156 (re)pooling @ 100%: After recloning', diff saved to https://phabricator.wikimedia.org/P58260 and previous config saved to /var/cache/conftool/dbconfig/20240229-161629-root.json [16:19:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: Excersie over', diff saved to https://phabricator.wikimedia.org/P58261 and previous config saved to /var/cache/conftool/dbconfig/20240229-161954-root.json [16:20:22] (03PS1) 10Effie Mouzeli: mw-mcrouter: enable public service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007644 [16:20:26] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9587681 (10MoritzMuehlenhoff) [16:20:48] 06SRE, 10SRE-swift-storage, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872#9587682 (10MatthewVernon) thanos and ms swift clusters OK post-move, thank you! [16:21:01] PROBLEM - Hadoop NodeManager on an-worker1174 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:23:10] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: enable public service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007644 (owner: 10Effie Mouzeli) [16:23:55] (03Merged) 10jenkins-bot: mw-mcrouter: enable public service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007644 (owner: 10Effie Mouzeli) [16:24:15] 06SRE, 06serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711#9587697 (10jijiki) [16:25:01] RECOVERY - Hadoop NodeManager on an-worker1174 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:26:50] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [16:27:12] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [16:31:04] (03PS1) 10Effie Mouzeli: mw-mcrouter: adjust staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007646 [16:31:27] (03PS2) 10Effie Mouzeli: mw-mcrouter: adjust staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007646 [16:31:29] (03PS3) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) [16:34:05] (03CR) 10Effie Mouzeli: [C: 03+2] mw-mcrouter: adjust staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007646 (owner: 10Effie Mouzeli) [16:35:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: Excersie over', diff saved to https://phabricator.wikimedia.org/P58262 and previous config saved to /var/cache/conftool/dbconfig/20240229-163459-root.json [16:35:06] (03Merged) 10jenkins-bot: mw-mcrouter: adjust staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007646 (owner: 10Effie Mouzeli) [16:36:43] (03Abandoned) 10Dzahn: delete grafana password classes [labs/private] - 10https://gerrit.wikimedia.org/r/1007011 (owner: 10Dzahn) [16:37:32] (03CR) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [16:37:50] (03CR) 10Btullis: [C: 03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [16:38:21] (03CR) 10Btullis: [C: 03+1] "Excellent, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [16:38:44] (03CR) 10Brouberol: [C: 03+2] superset: include all configuration checksums into the Deployment annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [16:38:51] (03CR) 10Brouberol: [C: 03+2] superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [16:38:59] (03PS3) 10Brouberol: superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) [16:39:01] (03CR) 10CI reject: [V: 04-1] superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [16:40:27] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [16:40:38] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [16:41:07] (03PS4) 10Brouberol: superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) [16:41:20] (03CR) 10CI reject: [V: 04-1] superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [16:41:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:42:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:42:08] (03PS5) 10Brouberol: superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) [16:42:16] (03CR) 10Btullis: superset: include all configuration checksums into the Deployment annotations (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007618 (https://phabricator.wikimedia.org/T352166) (owner: 10Brouberol) [16:43:03] (03PS2) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [16:43:05] (03PS2) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [16:43:07] (03PS2) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [16:43:26] (03CR) 10Brouberol: [C: 03+2] superset: scrape gunicorn metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007642 (https://phabricator.wikimedia.org/T358778) (owner: 10Brouberol) [16:43:40] (CirrusSearchNodeIndexingNotIncreasing) resolved: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:44:47] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:44:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:45:21] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [16:45:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [16:45:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [16:46:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [16:47:17] (03CR) 10CI reject: [V: 04-1] Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [16:48:25] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9587883 (10ayounsi) @abran-wmf this host needs a quick downtime to replace the SFP. Please sync up with Jenn. https://librenms.wikimedia.org/graphs/to=1709224800/id=29015/type=port_errors/from=1706546400/ [16:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:53:12] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1173.eqiad.wmnet with reason: Investigating disk errors [16:53:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1173.eqiad.wmnet with reason: Investigating disk errors [16:53:52] (03PS6) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [16:55:13] (03CR) 10Herron: [C: 03+1] icinga: Remove stale nagios_group entries [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [16:58:52] (03CR) 10Dzahn: [C: 03+1] icinga: Remove stale nagios_group entries [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [17:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:40] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9587910 (10ABran-WMF) This is M1 main server for codfw, checking how we'll handle it and will keep you posted [17:10:04] (ProbeDown) firing: (3) Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:11:30] (03PS3) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [17:11:33] (03PS3) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [17:11:35] (03PS3) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [17:12:12] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9587975 (10Jclark-ctr) @BTullis would you like to do before or after sre summit? [17:14:35] 06SRE, 10ops-eqiad, 06DC-Ops: hw move: GPU from stat1005 to stat1010 - https://phabricator.wikimedia.org/T358763#9587986 (10BTullis) >>! In T358763#9587968, @Jclark-ctr wrote: > @BTullis would you like to do before or after sre summit? Ideally before, please. I think we can do it at very short notice to us... [17:16:40] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T358787 (10phaultfinder) [17:20:04] (ProbeDown) resolved: (3) Service vrts1002:1443 has failed probes (http_ticket_test_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:14] (03PS1) 10Brouberol: superset: fix, add missing gunicorn statsd export config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007650 (https://phabricator.wikimedia.org/T358778) [17:32:58] (03PS23) 10Volans: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [17:33:01] (03PS1) 10Volans: sre.dns.wipe-cache: allow to wipe multiple domains [cookbooks] - 10https://gerrit.wikimedia.org/r/1007651 [17:33:03] (03PS1) 10Volans: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) [17:34:35] (03Abandoned) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980948 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [17:34:41] (03PS1) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:35:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:35:58] (03CR) 10CI reject: [V: 04-1] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:37:07] (03PS2) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:37:12] (03CR) 10Volans: "I've sent PS23 with my proposal to invert the flux and call it from the reimage to automatically and transparently support any reimage fla" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [17:37:31] (03CR) 10Volans: "This would be the bit in the reimage cookbook to call it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [17:37:45] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9588119 (10ssingh) Thanks for sharing @ayounsi! [17:38:18] (03CR) 10CI reject: [V: 04-1] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:41:44] (03CR) 10BBlack: [C: 03+1] P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:42:54] (03PS3) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:43:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:44:38] (03CR) 10CI reject: [V: 04-1] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:46:48] (03PS2) 10Ayounsi: [WIP] netbox: add functions to get and set device name [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 [17:48:00] (03PS4) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:48:24] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:48:31] (03CR) 10Andrea Denisse: [C: 03+2] icinga: Remove stale nagios_group entries [puppet] - 10https://gerrit.wikimedia.org/r/1007467 (https://phabricator.wikimedia.org/T358540) (owner: 10Andrea Denisse) [17:48:54] (03CR) 10Ayounsi: [C: 03+1] sre.dns.wipe-cache: allow to wipe multiple domains [cookbooks] - 10https://gerrit.wikimedia.org/r/1007651 (owner: 10Volans) [17:49:22] (03CR) 10Ssingh: [C: 03+1] "Looks good, very helpful change thanks! (PS: I think you will remember but we will need to update https://wikitech.wikimedia.org/wiki/DNS#" [cookbooks] - 10https://gerrit.wikimedia.org/r/1007651 (owner: 10Volans) [17:49:53] (03CR) 10Ayounsi: [WIP] netbox: add functions to get and set device name (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1007614 (owner: 10Ayounsi) [17:51:33] (03PS4) 10Slyngshede: C:tomcat10 Sync server.xml with default from Tomcat10/Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) [17:51:54] (03CR) 10CI reject: [V: 04-1] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:52:06] (03CR) 10Volans: [C: 03+2] "Sure will do, thanks for the remainder" [cookbooks] - 10https://gerrit.wikimedia.org/r/1007651 (owner: 10Volans) [17:52:19] (03PS5) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:53:45] (03PS1) 10Ssingh: depool codfw: (if required) maintenance work in codfw [dns] - 10https://gerrit.wikimedia.org/r/1007656 [17:53:51] (03CR) 10Slyngshede: "Major Tomcat upgrades is pretty infrequent, so anything to make our lives a little easier in the future is a plus." [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [17:55:05] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588268 (10cmooney) Ok I abandoned my previous change as my git skills weren't up to it. Prepping a new patc... [17:55:11] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:55:15] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1549/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007596 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [17:56:09] (03CR) 10CI reject: [V: 04-1] wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [17:56:30] (03Merged) 10jenkins-bot: sre.dns.wipe-cache: allow to wipe multiple domains [cookbooks] - 10https://gerrit.wikimedia.org/r/1007651 (owner: 10Volans) [17:57:45] (03PS6) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [17:58:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:00:05] bd808: That opportune time for a Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1800) [18:01:24] looks like I should do a developer-portal deploy in the window this week [18:01:25] (03PS3) 10Slyngshede: Load average for Swift cluster [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) [18:03:14] (03CR) 10Slyngshede: Load average for Swift cluster (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1004619 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [18:04:05] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-02-29-122131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007659 [18:05:55] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache sretest1001.eqiad.wmnet sretest1002.eqiad.wmnet on all recursors [18:05:58] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest1001.eqiad.wmnet sretest1002.eqiad.wmnet on all recursors [18:06:17] (03Abandoned) 10Slyngshede: C:prometheus::node_exporter allow CPU flags collection [puppet] - 10https://gerrit.wikimedia.org/r/979902 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [18:06:39] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2024-02-29-122131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007659 (owner: 10BryanDavis) [18:07:25] (03Abandoned) 10Slyngshede: Review access change [software/debmonitor-client] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/984171 (owner: 10Slyngshede) [18:07:28] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-02-29-122131-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007659 (owner: 10BryanDavis) [18:10:00] !log volans@cumin1002 START - Cookbook sre.dns.wipe-cache 'sretest1001.eqiad.wmnet$' on ulsfo recursors [18:10:02] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'sretest1001.eqiad.wmnet$' on ulsfo recursors [18:11:25] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:11:27] (03PS7) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [18:11:44] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:12:04] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:12:04] (03PS1) 10Cathal Mooney: Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007660 (https://phabricator.wikimedia.org/T352918) [18:12:28] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:12:36] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:12:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:12:57] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: Moving lvs2012 primary interface from private1-b-codfw to private1-b2-codfw [18:12:58] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:13:11] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: Moving lvs2012 primary interface from private1-b-codfw to private1-b2-codfw [18:13:21] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588323 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2521164e-4d59-47cb-8d79-8dd925725... [18:18:05] * bd808 is {{done}} with his deploy window [18:20:08] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [18:20:18] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [18:20:34] (03CR) 10Ssingh: [C: 03+1] "Looks good! Don't think we are missing any other config, checked tagged::subnets as well." [puppet] - 10https://gerrit.wikimedia.org/r/1007660 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [18:28:41] (03PS4) 10Btullis: Change the default systemd timer email source to noreply@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1007576 (https://phabricator.wikimedia.org/T358675) [18:28:43] (03PS4) 10Btullis: Allow systemd::timer::job to send from a custom address [puppet] - 10https://gerrit.wikimedia.org/r/1007577 (https://phabricator.wikimedia.org/T358675) [18:28:45] (03PS4) 10Btullis: Allow kerberos::systemd::timer to use a custom email sender [puppet] - 10https://gerrit.wikimedia.org/r/1007578 (https://phabricator.wikimedia.org/T358675) [18:29:10] (03CR) 10Volans: wdqs: Distinguish between public and internal monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:30:47] 06SRE, 10FY2023/2024-Q3, 13Patch-For-Review: Icinga Fails to Start Due to Missing Hostgroup 'swift' - https://phabricator.wikimedia.org/T358540#9588371 (10andrea.denisse) 05Open→03Resolved [18:30:55] 06SRE, 10FY2023/2024-Q3, 13Patch-For-Review: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9588372 (10andrea.denisse) [18:31:46] (03CR) 10Ryan Kemper: "Minor nits. Will do more thorough review later today" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:36:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:37:15] !log disabling PyBal on lvs2012 to move traffic to lvs2014 ahead of reimage T352918 [18:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:21] T352918: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 [18:38:39] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:38:49] ^ expected [18:39:30] ugh I should have downtimed the CRs too [18:40:14] thanks sukhe! [18:40:17] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr[1-2]-codfw with reason: lvs moves to per-rack vlans [18:40:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr[1-2]-codfw with reason: lvs moves to per-rack vlans [18:40:42] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588429 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1baa9b0f-d917-4da6-83db-5cb28b50c... [18:47:41] (ConfdResourceFailed) firing: (3) confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:48:25] yes ^ looking and fixing [18:48:30] (ProbeDown) firing: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:50] (03PS1) 10Ssingh: P:dns::auth::update: explicitly create service state directory [puppet] - 10https://gerrit.wikimedia.org/r/1007663 [18:51:04] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1552/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007663 (owner: 10Ssingh) [18:53:02] (ProbeDown) resolved: Service shellbox:4008 has failed probes (http_shellbox_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#shellbox:4008 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:53:54] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: explicitly create service state directory [puppet] - 10https://gerrit.wikimedia.org/r/1007663 (owner: 10Ssingh) [18:54:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:56:32] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588508 (10cmooney) [18:57:31] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588511 (10cmooney) [18:57:38] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=ntp [18:58:18] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=recdns [18:59:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=(recdns|ntp) [19:00:05] dduvall and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T1900). [19:02:41] (ConfdResourceFailed) resolved: (3) confd resource _var_lib_dnsbox_authdns_ns2.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:03:46] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:05:35] PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:06:05] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2012 - cmooney@cumin1002" [19:06:53] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2012.codfw.wmnet on all recursors [19:06:56] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2012.codfw.wmnet on all recursors [19:06:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2012 - cmooney@cumin1002" [19:06:57] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:09:23] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:10:03] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588555 (10cmooney) [19:10:37] RECOVERY - BGP status on lsw1-b2-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:10:42] (03CR) 10Cathal Mooney: [C: 03+2] Move lvs2012 from private1-b-codfw to private1-b2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007660 (https://phabricator.wikimedia.org/T352918) (owner: 10Cathal Mooney) [19:10:45] 06SRE, 10ops-eqiad: PowerSupplyFailure - an-coord1003 - https://phabricator.wikimedia.org/T358787#9588559 (10Dzahn) [19:14:34] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host lvs2012.codfw.wmnet with OS bullseye [19:14:34] (03CR) 10Bking: wdqs: Distinguish between public and internal monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:14:43] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588573 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host... [19:14:51] (03PS8) 10Bking: wdqs: Distinguish between public and internal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) [19:15:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:20:55] (03PS2) 10Dzahn: site: remove etherpad on bullseye machine [puppet] - 10https://gerrit.wikimedia.org/r/1003075 (https://phabricator.wikimedia.org/T316421) [19:22:00] (03CR) 10Dzahn: "to be merged as the last step after decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/1003075 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [19:22:43] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588615 (10cmooney) [19:23:03] (03CR) 10Bking: wdqs: Distinguish between public and internal monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007653 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:24:49] (03PS3) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) [19:25:15] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007691 (https://phabricator.wikimedia.org/T354438) [19:25:17] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007691 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:26:10] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007691 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:26:15] ok, train time. here comes the choo choo [19:29:32] (03PS2) 10Mabualruz: Performance Impact Assessment for Night Mode Style Correction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) [19:31:16] (03CR) 10Mabualruz: Performance Impact Assessment for Night Mode Style Correction (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [19:32:43] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [19:35:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2012.codfw.wmnet with reason: host reimage [19:35:34] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.20 refs T354438 [19:35:51] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [19:36:16] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10FY2023/2024-Q3-Q4, 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9588669 (10RKemper) a:05fnegri→03RKemper @fnegri @brouberol Yeah, Brian and I will work on... [19:36:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58264 and previous config saved to /var/cache/conftool/dbconfig/20240229-193643-root.json [19:46:18] (03PS3) 10Andrea Denisse: icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) [19:46:51] (03CR) 10Andrea Denisse: "Good observation, I've sent a new path. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [19:47:45] (03CR) 10CI reject: [V: 04-1] icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) (owner: 10Andrea Denisse) [19:48:52] (03PS4) 10Andrea Denisse: icinga: Set log group to 'nagios' to resolve permission conflicts [puppet] - 10https://gerrit.wikimedia.org/r/1007470 (https://phabricator.wikimedia.org/T358539) [19:51:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58265 and previous config saved to /var/cache/conftool/dbconfig/20240229-195148-root.json [19:57:23] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9588704 (10Ottomata) +1, or add this as a subtask of that? Either good with me! [19:58:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [19:58:24] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [20:00:46] (03Abandoned) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/980954 (https://phabricator.wikimedia.org/T352920) (owner: 10Cathal Mooney) [20:05:58] (03CR) 10Jdlrobson: [C: 03+1] Performance Impact Assessment for Night Mode Style Correction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [20:06:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58266 and previous config saved to /var/cache/conftool/dbconfig/20240229-200653-root.json [20:09:00] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops, 13Patch-For-Review: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920#9588735 (10cmooney) [20:17:05] (03CR) 10Dzahn: "in the latest PS I am just adding it to random node to be able to show in compiler it works while we dont have an actual host with the mig" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [20:20:21] (03CR) 10Dzahn: [C: 04-1] "and ""Scap::user" is not a valid resource reference"" [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) (owner: 10Dzahn) [20:21:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58267 and previous config saved to /var/cache/conftool/dbconfig/20240229-202158-root.json [20:22:19] (03PS1) 10Ssingh: Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007382 [20:23:48] (03PS4) 10Dzahn: phabricator: setup scap bin link in migration class [puppet] - 10https://gerrit.wikimedia.org/r/1006974 (https://phabricator.wikimedia.org/T357572) [20:25:31] (03Abandoned) 10Ssingh: Revert "pybal: do not install from component" [puppet] - 10https://gerrit.wikimedia.org/r/1007382 (owner: 10Ssingh) [20:25:46] (03PS2) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) [20:26:23] (03PS3) 10RLazarus: mediawiki: Restrict /wiki RewriteRule [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) [20:37:40] (03CR) 10Mabualruz: Performance Impact Assessment for Night Mode Style Correction (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [20:37:42] (03CR) 10RLazarus: [C: 03+1] "Sorry, I left this reply in drafts -- yes, LGTM." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973871 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [20:37:55] (03CR) 10RLazarus: [C: 03+1] slo_definitions: Use trafficserver_backend_sli_bad [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/973872 (https://phabricator.wikimedia.org/T341606) (owner: 10BCornwall) [20:41:04] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:42:43] 06SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9588874 (10Dzahn) ` [mwmaint1002:~] $ ldapsearch -x mail=fred*wikimedia.de .. uidNumber: 43019 .. ` [20:44:36] PROBLEM - BGP status on lsw1-b2-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:44:58] (03PS1) 10Ahmon Dancy: Switch from git-fat to git-lfs [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1007701 (https://phabricator.wikimedia.org/T357739) [20:45:36] RECOVERY - BGP status on lsw1-b2-codfw.mgmt is OK: BGP OK - up: 3, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:46:13] (03PS1) 10Dzahn: admin: add Frederik Ring to LDAP_only (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/1007702 (https://phabricator.wikimedia.org/T358584) [20:47:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2012.codfw.wmnet with OS bullseye [20:47:32] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588904 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2012.codfw.wmnet with... [20:48:30] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:11] (03CR) 10Dzahn: [C: 03+2] admin: add Frederik Ring to LDAP_only (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/1007702 (https://phabricator.wikimedia.org/T358584) (owner: 10Dzahn) [20:51:01] (03CR) 10Dzahn: [C: 03+2] "uidNumber: 43019" [puppet] - 10https://gerrit.wikimedia.org/r/1007702 (https://phabricator.wikimedia.org/T358584) (owner: 10Dzahn) [20:52:33] !log LDAP - added uid frri (43019) to groups nda and wmde (T358584 [20:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:43] T358584: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584 [20:53:42] (03PS1) 10Cathal Mooney: Move lvs2011 from private1-a-codfw to private1-a2-codfw vlan [puppet] - 10https://gerrit.wikimedia.org/r/1007703 (https://phabricator.wikimedia.org/T352920) [20:54:12] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9588938 (10Dzahn) Hi all. This is done. Frederik (frri in LDAP) has been added to the groups nda and wmde. Everything should work like for other WMDE employees. [20:55:28] (03PS1) 10Ssingh: pybal: install python-twisted from component/pybal [puppet] - 10https://gerrit.wikimedia.org/r/1007704 [20:56:29] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9588939 (10Dzahn) 05Open→03Resolved a:03Dzahn [20:56:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1555/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007704 (owner: 10Ssingh) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240229T2100). [21:00:05] jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:23] 06SRE, 06Infrastructure-Foundations, 06Traffic, 10netops: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918#9588964 (10cmooney) 05Open→03Resolved Moved to new vlan and BGP established between server and switch now. ` cmooney@lvs2012:/e... [21:00:29] 06SRE, 06Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869#9588966 (10cmooney) [21:00:48] I guess it's just my patch today. I can self deploy [21:01:19] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 320, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:58] (03PS1) 10Dzahn: admin: add Ifeatu Nnaobi to LDAP_only (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/1007706 (https://phabricator.wikimedia.org/T358091) [21:03:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [21:03:53] (03Merged) 10jenkins-bot: Performance Impact Assessment for Night Mode Style Correction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [21:03:56] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "uidNumber: 46162" [puppet] - 10https://gerrit.wikimedia.org/r/1007706 (https://phabricator.wikimedia.org/T358091) (owner: 10Dzahn) [21:04:07] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:1007352|Performance Impact Assessment for Night Mode Style Correction (T358240)]] [21:04:13] T358240: [Spike 4hours] Performance Impact Assessment for Night Mode Style Correction - https://phabricator.wikimedia.org/T358240 [21:05:37] !log jdrewniak@deploy2002 mabualruz and jdrewniak: Backport for [[gerrit:1007352|Performance Impact Assessment for Night Mode Style Correction (T358240)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:49] !log jdrewniak@deploy2002 mabualruz and jdrewniak: Continuing with sync [21:08:39] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007709 (https://phabricator.wikimedia.org/T128546) [21:12:44] @mo_abualruz @jan_drewniak are you still here and able to do another deploy? [21:13:36] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:1007352|Performance Impact Assessment for Night Mode Style Correction (T358240)]] (duration: 09m 28s) [21:13:39] (03PS1) 10Jdlrobson: Default to day mode [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) [21:13:44] T358240: [Spike 4hours] Performance Impact Assessment for Night Mode Style Correction - https://phabricator.wikimedia.org/T358240 [21:14:11] I am here [21:15:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) (owner: 10Jdlrobson) [21:16:56] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T358812 (10phaultfinder) [21:23:21] !log LDAP - added uid member: uid=ifeatunnaobiwmde,ou=people,dc=wikimedia,dc=org [21:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:21] !log LDAP - added uid ifeatunnaobiwmde (46162) to groups nda and wmde (T358091) [21:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:30] T358091: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091 [21:25:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [21:25:48] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9589054 (10Dzahn) 05Open→03Resolved ` [mwmaint1002:~] $ ldapsearch -x mail=ifeatu*wikimedia.de .. uidNumber: 46162 ` ✓ --- Hi all, this is done. Ifeatu (if... [21:25:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [21:25:58] (03CR) 10CI reject: [V: 04-1] Default to day mode [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) (owner: 10Jdlrobson) [21:26:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2114 (T352010)', diff saved to https://phabricator.wikimedia.org/P58268 and previous config saved to /var/cache/conftool/dbconfig/20240229-212602-ladsgroup.json [21:26:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:26:12] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9589058 (10wiki_willy) If we change the serial number, I think it would create an error for S/N / Asset tag mismatch. (related to Riccardo's points earlier) We also reference the original chassis S... [21:26:50] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9589062 (10dr0ptp4kt) [21:28:14] 06SRE, 10ops-eqiad, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9589065 (10dr0ptp4kt) [21:29:29] (03CR) 10Scott French: [C: 03+1] "Thanks for fixing this, Reuven." [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [21:29:52] !log phabricator - added Ifeatu_Nnaobi_WMDE to WMF-NDA (group 61) - T358578 [21:29:56] (03PS2) 10Jdlrobson: Default to day mode [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) [21:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:58] T358578: Add WMDE staff who have signed the NDA with the WMF to the WMF-NDA phabricator policy group - https://phabricator.wikimedia.org/T358578 [21:30:03] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:30:57] (03PS3) 10Jdlrobson: Default to day mode [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) [21:31:07] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007709 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:31:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) (owner: 10Jdlrobson) [21:31:54] !log phabricator - added Fring to WMF-NDA (group 61) - T358578 [21:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:05] (03CR) 10RLazarus: mediawiki: Restrict /wiki RewriteRule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [21:41:31] (03CR) 10Muehlenhoff: "Why is the component/pybal no longer included? When I made the original backport pybal itself (back then in version 1.15.13) and all it's " [puppet] - 10https://gerrit.wikimedia.org/r/1007704 (owner: 10Ssingh) [21:42:06] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1007709| Bumping portals to master (T128546)]] (duration: 08m 40s) [21:42:12] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:48:58] PROBLEM - Hadoop NodeManager on an-worker1165 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:50:29] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1007709| Bumping portals to master (T128546)]] (duration: 08m 23s) [21:50:36] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:51:15] (03Merged) 10jenkins-bot: Default to day mode [skins/MinervaNeue] (wmf/1.42.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1007383 (https://phabricator.wikimedia.org/T358811) (owner: 10Jdlrobson) [21:51:28] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:1007383|Default to day mode (T358811)]] [21:51:38] T358811: Set Minerva default to day mode - https://phabricator.wikimedia.org/T358811 [21:52:51] !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Backport for [[gerrit:1007383|Default to day mode (T358811)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:52] (03CR) 10Muehlenhoff: "Oh, I had forgotten about https://gerrit.wikimedia.org/r/c/operations/puppet/+/976312, that patch was actually wrong. The original version" [puppet] - 10https://gerrit.wikimedia.org/r/1007704 (owner: 10Ssingh) [21:53:57] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9589134 (10Dzahn) @Ifeatu_Nnaobi_WMDE You have also been added to the WMF-NDA group in Phabricator and can now see restricted tickets. [21:54:09] !log jdrewniak@deploy2002 jdlrobson and jdrewniak: Continuing with sync [21:57:04] RECOVERY - Hadoop NodeManager on an-worker1165 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:57:27] !log phabricator - added STran to WMF-NDA (group 61) - T355388 [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] T355388: Request to add STran to WMF-NDA group - https://phabricator.wikimedia.org/T355388 [22:02:09] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:1007383|Default to day mode (T358811)]] (duration: 10m 40s) [22:02:17] T358811: Set Minerva default to day mode - https://phabricator.wikimedia.org/T358811 [22:05:06] (03CR) 10Scott French: [C: 03+1] mediawiki: Restrict /wiki RewriteRule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [22:06:17] (03PS2) 10Dzahn: delete passwords::mysql::wikimania_scholarships and passwords::tor [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 [22:08:42] (03PS3) 10Dzahn: delete passwords for wikimania_scholarships, tor, private_static_site [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 [22:08:48] (03CR) 10Dzahn: [V: 03+1] "All of these are historic and don't exist anymore in the private repo." [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 (owner: 10Dzahn) [22:09:34] (03CR) 10Dzahn: [V: 03+2 C: 03+2] delete passwords for wikimania_scholarships, tor, private_static_site [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 (owner: 10Dzahn) [22:22:24] (03CR) 10Dzahn: "the linked ticket is closed since 2021 but this was never merged" [labs/private] - 10https://gerrit.wikimedia.org/r/739586 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [22:24:03] (03CR) 10Dzahn: "the linked ticket was closed as declined in 2022 but this is still open" [labs/private] - 10https://gerrit.wikimedia.org/r/672451 (https://phabricator.wikimedia.org/T277483) (owner: 10Dave Pifke) [22:31:15] (03CR) 10Dzahn: "open since 2017 but meanwhile added in 2020/2021 - https://gerrit.wikimedia.org/r/c/labs/private/+/572918/1/hieradata/labs.yaml" [labs/private] - 10https://gerrit.wikimedia.org/r/340148 (owner: 10Tim Landscheidt) [22:37:36] !log removing 4 files for legal compliance [22:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:05] (03PS1) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 [23:39:09] (03PS1) 10RLazarus: admin: Flatten $user_list before joining [puppet] - 10https://gerrit.wikimedia.org/r/1007739 (https://phabricator.wikimedia.org/T358361) [23:39:11] (03PS1) 10RLazarus: Revert "admin: Remove *sre_admins_members from datacenter-ops" [puppet] - 10https://gerrit.wikimedia.org/r/1007740 (https://phabricator.wikimedia.org/T358361) [23:40:54] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1556/console" [puppet] - 10https://gerrit.wikimedia.org/r/1007740 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [23:52:43] (03PS1) 10Volans: cumin: fix insetup role report mapping [puppet] - 10https://gerrit.wikimedia.org/r/1007743 [23:58:44] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9589386 (10Volans) I don't think there is a clean solution if the iDrac doesn't allow to override the value on the motherboard when done outside of warranty. We could check if there is a way on the...