[00:19:02] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:20] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye [00:34:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225 [00:38:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225 (owner: 10TrainBranchBot) [00:55:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225 (owner: 10TrainBranchBot) [01:03:34] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342592 (10phaultfinder) [01:16:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye [01:16:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [01:17:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye [01:17:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye [01:30:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1014.eqiad.wmnet with OS bullseye [01:30:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**) - Rem... [01:34:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:19] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) Thank you @akosiaris We can only run client requests in the production URL, I guess it'll do for now until we com... [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0200) [02:03:48] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247) [02:07:43] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [02:18:32] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:18:34] (03PS10) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:18:36] (03PS1) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [02:18:58] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:19:02] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:54] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [02:24:19] (03PS11) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:24:43] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:31:08] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:59] (03PS12) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:32:22] (03CR) 10jenkins-bot: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:32:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:51] (03PS13) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:36:15] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:38:12] (03PS14) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:38:35] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:41:46] (03PS2) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [02:41:48] (03PS15) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:42:12] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:45:40] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:29] (03PS3) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [02:52:31] (03PS16) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 [02:52:56] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott) [02:57:02] (03PS4) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [02:57:04] (03PS17) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [02:57:28] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [02:59:31] (03PS1) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034 [03:00:06] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0300) [03:04:33] (03PS5) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [03:04:35] (03PS18) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [03:04:37] (03PS2) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034 [03:04:58] (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [03:08:48] (03PS19) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [03:11:24] (03Abandoned) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034 (owner: 10Andrew Bogott) [03:13:35] (03PS20) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [04:44:49] (03PS1) 10Ryan Kemper: decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035) [04:56:39] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs[2004-2006].codfw.wmnet [04:57:18] (03PS2) 10Ryan Kemper: decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035) [05:08:08] (03CR) 10Ryan Kemper: [C: 03+2] decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035) (owner: 10Ryan Kemper) [05:08:32] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts wdqs[2004-2006].codfw.wmnet [05:09:01] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs[2004-2006].codfw.wmnet [05:21:42] (03CR) 10Ryan Kemper: "I'll take a note for Brian/I to get this reviewed and deployed this week." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [05:27:50] 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10RKemper) [05:46:17] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [05:52:29] !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0600). [06:10:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [06:10:38] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:10:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs[2004-2006].codfw.wmnet [06:12:46] (03PS1) 10Marostegui: db1213: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941040 (https://phabricator.wikimedia.org/T334650) [06:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1213 (s5, s6)', diff saved to https://phabricator.wikimedia.org/P49680 and previous config saved to /var/cache/conftool/dbconfig/20230725-061319-root.json [06:14:49] (03CR) 10Marostegui: [C: 03+2] db1213: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941040 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [06:17:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49681 and previous config saved to /var/cache/conftool/dbconfig/20230725-061742-root.json [06:17:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49682 and previous config saved to /var/cache/conftool/dbconfig/20230725-061753-root.json [06:21:14] (03PS1) 10Marostegui: pc1015,pc1016: New hosts to be set up [puppet] - 10https://gerrit.wikimedia.org/r/941042 (https://phabricator.wikimedia.org/T342164) [06:21:53] (03CR) 10Marostegui: [C: 03+2] pc1015,pc1016: New hosts to be set up [puppet] - 10https://gerrit.wikimedia.org/r/941042 (https://phabricator.wikimedia.org/T342164) (owner: 10Marostegui) [06:32:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49683 and previous config saved to /var/cache/conftool/dbconfig/20230725-063247-root.json [06:32:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49684 and previous config saved to /var/cache/conftool/dbconfig/20230725-063258-root.json [06:33:51] (03CR) 10JMeybohm: [C: 04-1] "Two open comments from me on PS9" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [06:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49685 and previous config saved to /var/cache/conftool/dbconfig/20230725-064751-root.json [06:48:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49686 and previous config saved to /var/cache/conftool/dbconfig/20230725-064802-root.json [07:00:06] Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49687 and previous config saved to /var/cache/conftool/dbconfig/20230725-070256-root.json [07:03:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 131 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:03:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49688 and previous config saved to /var/cache/conftool/dbconfig/20230725-070307-root.json [07:08:56] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester) [07:18:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49689 and previous config saved to /var/cache/conftool/dbconfig/20230725-071801-root.json [07:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49690 and previous config saved to /var/cache/conftool/dbconfig/20230725-071812-root.json [07:19:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49691 and previous config saved to /var/cache/conftool/dbconfig/20230725-073305-root.json [07:33:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49692 and previous config saved to /var/cache/conftool/dbconfig/20230725-073317-root.json [07:40:06] (03PS1) 10JMeybohm: wmnet: Add cnames for'wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) [07:40:28] (03PS2) 10JMeybohm: wmnet: Add cnames for wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) [07:48:02] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49693 and previous config saved to /var/cache/conftool/dbconfig/20230725-074810-root.json [07:48:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49694 and previous config saved to /var/cache/conftool/dbconfig/20230725-074821-root.json [07:51:12] (03PS1) 10JMeybohm: service::catalog: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) [07:51:14] (03PS1) 10JMeybohm: service::catalog: Switch wikifunctions to state production [puppet] - 10https://gerrit.wikimedia.org/r/941314 (https://phabricator.wikimedia.org/T297314) [07:51:52] (03PS1) 10Elukey: role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) [07:53:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42684/console" [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [07:55:56] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247) [07:55:58] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [07:56:38] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [07:57:04] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.19 refs T340247 [07:57:08] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [08:00:05] jnuche and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0800). [08:01:04] morning, the train pre-sync failed last night [08:01:26] I think I've fixed the issue (rebased sec patch needed to be applied) and I'm rerunning the pre-sync now [08:03:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49695 and previous config saved to /var/cache/conftool/dbconfig/20230725-080315-root.json [08:03:20] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:03:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49696 and previous config saved to /var/cache/conftool/dbconfig/20230725-080326-root.json [08:03:58] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:04:48] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:09:18] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:22] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [08:15:12] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:22:36] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:24:18] (03CR) 10Vgutierrez: "looks good but I think we should consider performance as well:" [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh) [08:24:56] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [08:25:42] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:26:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto) [08:27:20] (03Merged) 10jenkins-bot: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto) [08:30:18] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:30:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:47] (03PS2) 10Elukey: role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) [08:33:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42685/console" [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [08:33:42] (03CR) 10JMeybohm: [C: 03+1] role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [08:34:29] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [08:35:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:35:36] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker [08:35:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker [08:40:17] (03PS1) 10Jelto: idp: remove nda from required_groups for gitlab_replica_oidc [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390) [08:41:40] (03CR) 10Slyngshede: [C: 03+1] "I doubt that makes much of a difference, but I see no reason to not give it a try." [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [08:43:33] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) I did a diff of the configurations for idp and idp-test, and they are basically the same, none of the settings tha... [08:49:38] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.19 refs T340247 (duration: 52m 35s) [08:49:43] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [08:49:49] (03PS1) 10Elukey: profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) [08:51:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42686/console" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [08:51:52] !log jnuche@deploy1002 Pruned MediaWiki: 1.41.0-wmf.17 (duration: 02m 11s) [08:52:21] (03PS1) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 [08:52:46] (03CR) 10CI reject: [V: 04-1] Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 (owner: 10Elukey) [08:53:43] pre-sync done, deploying train to group0 now [08:53:53] (03PS2) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 [08:54:05] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247) [08:54:07] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:54:48] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot) [08:56:52] (03Abandoned) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 (owner: 10Elukey) [08:57:26] (03PS13) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [08:58:52] (03PS2) 10Elukey: profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) [08:59:03] <_joe_> jouncebot: next [08:59:03] In 1 hour(s) and 0 minute(s): MediaWiki-related infrastuctural changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1000) [08:59:25] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:59:28] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:59:43] (03PS1) 10Fabfur: Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T321309) [09:00:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42687/console" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [09:00:22] (03PS1) 10Jcrespo: bacula: Increase the number of max volumes for production pool [puppet] - 10https://gerrit.wikimedia.org/r/941368 [09:01:32] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.19 refs T340247 [09:01:36] T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247 [09:01:42] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42688/console" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [09:02:53] (03PS2) 10Fabfur: Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) [09:03:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:31] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:06:48] !log Restart Tomcat / Apereo CAS on idp1002 [09:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:10:05] (03CR) 10CI reject: [V: 04-1] Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [09:10:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [09:12:08] (03CR) 10JMeybohm: [C: 03+1] "We could also discuss not rolling the "changes" out at all to jumbo and logging if we don't have issues there..." [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [09:14:05] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:24:29] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:30:03] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:31:53] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Yeah, this is my concern, too - we used to spawn extra requests to copy new thumbnails to the other DC and that ca... [09:33:07] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:33:37] (03CR) 10Alexandros Kosiaris: admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [09:34:09] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:21] PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:39:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:41:51] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:45:41] PROBLEM - SSH on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:46:23] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:46:39] PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:39] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) We restarted `idp1002` and `idp-test1002`. It seems the running configuration was not the one configured, because tomcat is... [09:46:47] RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:46:47] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [09:46:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker [09:47:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker [09:47:55] RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:23] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:48:42] (SystemdUnitFailed) firing: nginx.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:57] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [09:49:07] (03CR) 10Hnowlan: [C: 03+1] api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [09:50:55] !log restart kafka on kafka-main1001 to pick up the new changes - T341558 [09:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:59] T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 [09:52:46] (03PS3) 10Filippo Giunchedi: mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) [09:53:42] (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:55:07] ^ expected [09:56:59] (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the number of max volumes for production pool [puppet] - 10https://gerrit.wikimedia.org/r/941368 (owner: 10Jcrespo) [09:59:13] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) It might sound a bit stupid: Why not just gradually, slowly, roll delete all thumbnails, if it's needed, it'll be rege... [09:59:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:00:04] akosiaris: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki-related infrastuctural changes deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1000). [10:00:05] (03PS14) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [10:00:12] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [10:00:44] (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [10:01:19] (03CR) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [10:01:36] (03CR) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [10:01:53] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [10:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:04:27] (03CR) 10JMeybohm: [C: 03+1] "Cool! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [10:06:59] I haven't yet started the wikidiff2 deploy, doing some unexpected prepwork [10:11:16] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:11:34] woot [10:11:36] Ugh [10:12:13] latency went up and now it is trending down [10:12:35] this is the thing we got p.aged about on Saturday too I think [10:12:45] yeah [10:13:00] think I'll reopen T342085 [10:13:01] T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085 [10:15:01] yeah, looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1689984000000&orgId=1&to=1690329599000&var-cluster=parsoid&var-datasource=eqiad+prometheus%2Fops&var-method=GET the latency is up again [10:16:07] Emperor: There's a big increase in timeouts, and they seem to be mostly coming from two userpages [10:16:16] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:16:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10BTullis) Hello. I'm listed as one of the approvers for this group, but there are a couple of things that I would like to check first, before proc... [10:17:45] (03PS1) 10Amire80: Remove ak from wgImportSources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941372 (https://phabricator.wikimedia.org/T333765) [10:18:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:19:36] (03PS1) 10Giuseppe Lavagetto: requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373 [10:21:47] (03CR) 10JMeybohm: [C: 03+1] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [10:23:11] (03CR) 10JMeybohm: [C: 04-1] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [10:26:28] (03CR) 10Btullis: [C: 03+1] "Bit late to the party, but this is fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [10:27:21] (03PS1) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [10:27:23] (03CR) 10Btullis: [V: 03+1 C: 03+2] Exclude nagios checks of tmpfs mounts on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:27:36] (03CR) 10Btullis: [C: 03+2] Install the ceph-volume and hdparm packages on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941010 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [10:28:16] (03PS2) 10Btullis: Exclude nagios checks of tmpfs mounts on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) [10:28:44] (03PS2) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) [10:29:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:37:01] (03PS1) 10Ladsgroup: Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) [10:40:55] (03CR) 10JMeybohm: [C: 03+1] "whitespace nits, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [10:41:15] (03CR) 10JMeybohm: [C: 03+1] "Comment on the modules/base/values.yaml question, but I think it's fine either way" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [10:44:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:45:57] (03PS1) 10Giuseppe Lavagetto: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 [10:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:48:55] (03CR) 10JMeybohm: [C: 04-1] modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [10:49:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 31 days, 0:00:00 on lvs[1013-1015].eqiad.wmnet with reason: test hosts [10:51:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 31 days, 0:00:00 on lvs[1013-1015].eqiad.wmnet with reason: test hosts [10:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:57:36] (03CR) 10Hnowlan: [C: 03+2] cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan) [10:59:29] (03PS7) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [10:59:57] (03CR) 10Jforrester: [C: 03+1] wmnet: Add cnames for wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [11:00:58] (03PS2) 10Hnowlan: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [11:02:57] (03CR) 10Marostegui: [C: 03+1] Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup) [11:03:34] (03PS2) 10Giuseppe Lavagetto: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 [11:03:58] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [11:05:21] (03PS4) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) [11:05:58] (03PS3) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:06:05] (03PS4) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:06:08] (03CR) 10MVernon: "So this is splitting out the parsoid alerts to now page after 5m of <50% idle workers, and also returning the previous alerts (for <30% id" [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [11:06:10] (03CR) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:06:34] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero) [11:07:35] (03PS5) 10Jforrester: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:08:27] (03PS8) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [11:09:13] (03PS1) 10Btullis: Stop repeatedly disabling write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151) [11:10:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42689/console" [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [11:11:20] (03CR) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [11:12:06] (03CR) 10Klausman: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [11:12:16] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [11:12:45] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [11:14:31] (03CR) 10Btullis: [V: 03+1 C: 03+2] Stop repeatedly disabling write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [11:15:13] (03CR) 10JMeybohm: [C: 03+1] admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:21:54] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [11:22:48] akosiaris: hi, just checking in, did wikidiff2 get deployed or are you still on the unexpected work? :-) [11:23:06] TheresNoTime: starting right now [11:23:14] ack :) [11:24:17] !log T340087 starting wikidiff2 1.41.1 rollout to codfw [11:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:21] T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 [11:24:29] Ooh, fancy. [11:24:54] … how do we get downstreams like Debian to pick up new releases of wikidiff2? [11:25:08] !log T340087 keep a copy php-wikidiff2_1.13.0-1_amd64.deb in apt1001:/home/akosiaris/wd/ in case of emergency [11:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:25] James_F: I assume we ping legok.tm :-) [11:25:54] OK, fair, but what about randoms like 1&1 MW hosting etc. :-) [11:26:03] Do we just hope they notice? [11:26:03] Maintainer: MediaWiki packaging team [11:26:13] PROBLEM - PHP opcache health on mw1457 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:26:31] hmm this one ^ has nothing to do with my change. /me looking [11:26:35] I don't think I've consciously seen us ever send a note to mediawiki-announce or whatever. [11:27:14] James_F: I don't think so either. [11:27:15] (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [11:27:34] Something for the new MW group to think about. [11:27:55] James_F: https://tracker.debian.org/news/1443137/accepted-wikidiff2-1141-1-source-into-unstable/ [11:28:03] !log restart php on mw1457 [11:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:14] taavi: Of course you're already on this. :-) [11:28:16] * akosiaris waiting for this to clear out and then proceeding with wikidiff2 in eqiad [11:29:15] RECOVERY - PHP opcache health on mw1457 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:29:58] !log T340087 starting wikidiff2 1.41.1 rollout to eqiad. codfw already done. [11:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:02] T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 [11:32:17] !log T340087 wikidiff2 rollout done. 1 host is unreachable and will need to be reimaged or upgraded manually to pick this up, parse1002.eqiad.wmnet [11:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:24] (03CR) 10Hnowlan: [C: 03+2] blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [11:32:26] (03CR) 10MVernon: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [11:32:28] TheresNoTime: And we are done. [11:32:37] woo, thank you :) [11:32:52] yw [11:33:58] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:34:55] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 131 jobs Jcrespo backups now catching up after storage issue solved - The acknowledgement expires at: 2023-07-26 11:34:11. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:35:14] ^ Emperor, marostegui [11:35:25] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [11:35:33] thanks [11:35:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm) [11:36:29] (03Merged) 10jenkins-bot: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno) [11:36:40] (03Abandoned) 10Jelto: idp: remove nda from required_groups for gitlab_replica_oidc [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto) [11:36:53] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001" [11:37:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001" [11:37:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:40:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:40:47] (03PS3) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161) [11:41:42] (03Merged) 10jenkins-bot: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:45:41] !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:45:56] 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) We previously suspected that the issue was that CAS nested the attributes it returns via the profile, but gave up... [11:46:12] (03CR) 10Jon Harald Søby: [C: 04-1] [DNM] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [11:46:34] 10SRE, 10Infrastructure-Foundations, 10Traffic: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Reedy) [11:46:57] !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:47:26] (03CR) 10Ladsgroup: [C: 03+2] Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup) [11:47:35] !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:47:51] (03Merged) 10jenkins-bot: Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup) [11:48:05] !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:48:15] (03PS15) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [11:48:48] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:48:59] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:49:41] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:49:50] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:54:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [11:54:59] (03Merged) 10jenkins-bot: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris) [12:02:25] (03PS1) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) [12:05:30] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42690/console" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:06:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:06:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49699 and previous config saved to /var/cache/conftool/dbconfig/20230725-120641-ladsgroup.json [12:06:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:08:30] (03PS2) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) [12:08:43] (03CR) 10Jelto: [C: 03+1] "lgtm, this should only affect idp-test and gitlab-replica. We should keep in mind that we set NESTED for all other OIDC clients as well wi" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:11:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42691/console" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede) [12:12:12] (03PS7) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [12:12:14] (03PS9) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [12:12:16] (03PS9) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [12:12:25] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:16:12] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:18:14] (03PS8) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [12:18:16] (03PS10) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [12:18:18] (03PS10) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [12:18:26] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:24:11] (03PS3) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) [12:24:28] (03CR) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:27:24] (03PS1) 10Elukey: role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) [12:30:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42692/console" [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [12:34:39] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:36:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49700 and previous config saved to /var/cache/conftool/dbconfig/20230725-123602-ladsgroup.json [12:36:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:41:33] (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [12:43:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Clement_Goubert) Note for #serviceops later: once fixed, the host will need to be updated to pick up {T340087} [12:45:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:29] (03CR) 10Clément Goubert: [C: 03+1] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [12:49:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [12:49:15] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342592 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [12:50:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [12:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P49701 and previous config saved to /var/cache/conftool/dbconfig/20230725-125109-ladsgroup.json [12:55:25] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey) [12:58:54] (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1300). [13:00:05] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1300) [13:00:13] \o [13:00:24] I can deploy in a few moments, just wrapping up another thing [13:00:38] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T342565 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cord, alert did not clear. reseated PSU1, alert cleared. [13:00:49] (if someone else is around, feel free to go ahead) [13:00:59] (03PS9) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [13:01:01] i can deploy today [13:01:08] hi Dreamy_Jazz [13:01:12] Hello. [13:01:20] thx urbanecm [13:01:31] I'm also coordinating with Ladsgroup to check that the tables are not being replicated to cloud DBs [13:01:41] great [13:02:00] Dreamy_Jazz: do we have all relevant code in wmf.19 (maybe even .18) already? [13:02:09] As far as I am aware, yes [13:02:14] But let me double check [13:03:09] wmf.19 has all the relevant changes [13:03:29] Including the moving of the default value to write new (as well as write and read old) [13:03:32] but not .18, afaik. [13:03:37] Yes [13:03:42] Not wmf.18 [13:03:44] which means a train rollback can cause the code to be missing [13:03:50] what would happen in that scenario? [13:04:33] * Lucas_WMDE also around now but probably not needed :) [13:04:36] I'm not sure [13:04:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [13:04:55] Though what I think would happen is that the code would only write old [13:05:08] But let me check that [13:05:18] okay [13:05:29] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:05:57] (03Merged) 10jenkins-bot: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto) [13:06:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P49702 and previous config saved to /var/cache/conftool/dbconfig/20230725-130615-ladsgroup.json [13:06:40] 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul) 05Open→03Resolved a:03Papaul This s complete [13:06:54] (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [13:07:07] Unfortunately, it looks like at least some of the code would end up only writing new [13:07:33] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:07:37] So if you think the possibility of a train rollback on testwiki is too high, then perhaps waiting to next week you think? [13:09:52] urbanecm: ^? [13:09:54] if train rollback means write new behavior in some branches, i think that's a significant issue, as it'll give us inconsistent data. i'm afraid fixing it could get difficult, esp. if the `cuc_only_for_read_old`gets inconsistent as well [13:10:10] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) 05Open→03Resolved The test server has been returned so we are good to close this task. [13:10:10] i think waiting for next week is a good idea. alternatively, we can backport stuff. [13:10:21] there will be icinga config failure alerts for alert hosts, that is me [13:10:31] but afaik it's quite a lot of patches to backport through? [13:10:58] (03CR) 10Jon Harald Søby: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [13:11:19] I think it would only be one patch that would need backporting [13:11:33] Actually would be two because of the follow-up [13:12:04] It would be https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CheckUser/+/59888cddf495de7ea2b4ec6ff563f9543713281b and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CheckUser/+/213281cf65997cd52701845208307b39e06e9163 [13:12:30] gotcha. then, up2you. happy to backport those two if you want to do this earlier rather than later. [13:13:05] It would be good to test both replication to cloud DBs and have a good amount of testing time on testwiki [13:13:17] let's backport then :) [13:13:21] Thanks :) [13:13:34] (03PS1) 10Urbanecm: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) [13:14:16] The other patches that were needed to change the default modify code that wouldn't be run unless update.php is run and/or those maintenance scripts are run. This won't happen unless someone was to run them manually. [13:14:20] (03PS1) 10Urbanecm: Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) [13:14:23] (03CR) 10Urbanecm: [C: 03+2] Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm) [13:14:30] (03CR) 10Urbanecm: [C: 03+2] Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm) [13:15:47] gotcha [13:16:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm) [13:16:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm) [13:17:21] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [13:18:29] syncing w/o testing, as it can't really be tested with write new, and its no-op otherwise [13:19:10] Plus that change is on group0 wikis as it's in wmf.19 (so it shouldn't be an issue). [13:19:32] yup [13:19:49] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:20:03] !log powercycle parse1002 - T339340 [13:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] T339340: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 [13:20:09] PROBLEM - Check systemd state on parse1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49704 and previous config saved to /var/cache/conftool/dbconfig/20230725-132121-ladsgroup.json [13:21:26] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:21:39] RECOVERY - Check systemd state on parse1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10fgiunchedi) I rebooted the host because I needed a puppet run on it, I'll leave it alone now! [13:24:15] RECOVERY - Check systemd state on mw1424 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:49] (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto) [13:26:07] (03PS1) 10Majavah: Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) [13:26:27] 10SRE, 10serviceops-radar, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [13:28:06] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [13:29:51] (03CR) 10CDanis: [C: 03+2] requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373 (owner: 10Giuseppe Lavagetto) [13:29:56] (03Merged) 10jenkins-bot: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm) [13:30:07] (03Merged) 10jenkins-bot: Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm) [13:30:13] there we go [13:30:20] Great. [13:30:22] Still around. [13:30:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:941414|Add support for writing both new and old to Hooks.php (T341934 T341586)]], [[gerrit:941400|Follow-up: Add support for writing both new and old to Hooks.php (T341586)]] [13:30:59] T341586: Allow write old and new for event table migration - https://phabricator.wikimedia.org/T341586 [13:30:59] T341934: Failing tests for CheckUser when event table migration config set to WRITE_BOTH and READ_NEW - https://phabricator.wikimedia.org/T341934 [13:31:10] it'll go w/o the mwdebug stop though, as i mentioned. [13:32:56] (03Merged) 10jenkins-bot: requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373 (owner: 10Giuseppe Lavagetto) [13:38:21] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:941414|Add support for writing both new and old to Hooks.php (T341934 T341586)]], [[gerrit:941400|Follow-up: Add support for writing both new and old to Hooks.php (T341586)]] (duration: 07m 28s) [13:38:26] T341586: Allow write old and new for event table migration - https://phabricator.wikimedia.org/T341586 [13:38:26] T341934: Failing tests for CheckUser when event table migration config set to WRITE_BOTH and READ_NEW - https://phabricator.wikimedia.org/T341934 [13:38:28] (03PS4) 10Urbanecm: Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:38:31] (03CR) 10Urbanecm: [C: 03+2] Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:38:38] so, backport done [13:38:43] let's move on to the config change now [13:38:45] Nice. Thanks! [13:38:52] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001" [13:38:53] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1486.eqiad.wmnet with OS buster [13:39:12] (03Merged) 10jenkins-bot: Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz) [13:40:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]] [13:40:13] (03PS9) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [13:40:15] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [13:40:22] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "re-run to fix mw1486 - cgoubert@cumin1001" [13:41:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "re-run to fix mw1486 - cgoubert@cumin1001" [13:41:14] (03PS1) 10Giuseppe Lavagetto: Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859) [13:41:49] !log urbanecm@deploy1002 urbanecm and dreamyjazz: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:41:58] Dreamy_Jazz: available at mwdebug now :). can you test? [13:42:05] Sure [13:42:09] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on parse1002.eqiad.wmnet with reason: T339340 - hw troubleshooting [13:42:12] T339340: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 [13:42:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on parse1002.eqiad.wmnet with reason: T339340 - hw troubleshooting [13:43:06] Will events appear in logstash for debug servers? [13:43:12] Just want to look for any exceptions [13:43:48] Dreamy_Jazz: https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002?_g=h@42b0d52&_a=h@7f0701a [13:43:53] Dreamy_Jazz: yes, they will, on the mwdebug server dashboard too [13:44:01] which claime helpfully linked, thank you. [13:44:07] Thanks both [13:44:09] Still testing [13:44:39] you can also enable Verbose log, which will direct all logs (even debug logs) to logstash [13:45:34] 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10bking) [13:46:59] Nearly done. Will need someone to inspect DB shortly as logstash doesn't show the insert queries [13:47:11] sure [13:48:09] (fwiw, you would see insert queries with verbose logging, but happy to inspect manually) [13:48:16] Hmm. [13:48:23] I didn't find them on that page [13:49:29] Okay. My part in the testing is done. [13:49:59] If you could inspect the DB and see if "cu_log_event" and "cu_private_event" have entries [13:50:01] https://logstash.wikimedia.org/goto/be7902d11198d60c31c068e7a3ee25ce shows a bunch of inserts [13:50:02] sure [13:50:27] i see two entries in each table [13:50:27] Thanks for that link. [13:51:04] Yup. [13:51:16] That is the expected state [13:51:18] great [13:51:25] should we check Special:Checkuser works as expected too? [13:51:27] (I moved twice and logged out and then back in) [13:51:37] Sure. [13:51:53] yup, matches what i see in the tables [13:52:07] The change in this config should not have affected anything in Special:CheckUser, but happy to test it anyway [13:52:49] just to be on the safe end [13:53:08] For sure. [13:53:24] Ready to test if you would like me to do so (will need CU rights on testwiki) [13:53:36] granted [13:53:39] please go ahead [13:54:55] Special:CheckUser works as normal (no change). Will try Investigate [13:55:38] I have hit an exception in Investigate, but it seems unrelated to this config change [13:56:05] "Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki" [13:56:19] I will check if that happens on non-debug servers [13:56:33] Yup. Same error on non-debug servers [13:56:36] So should be unrelated [13:56:39] okay, so unrelated [13:56:45] so, let's proceed then? [13:56:49] Yes. [13:56:57] deploying [13:57:02] I will file a bug report for the issue that I found in Investigate. [13:57:05] (03PS1) 10Hnowlan: trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967) [13:57:06] ty [13:57:14] Thanks for your help on this! [13:57:19] any time [14:00:21] !log rolling out pdns-recursor update on A:dns-rec [14:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]] (duration: 22m 27s) [14:02:42] T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158 [14:02:43] and live [14:02:47] Yay! [14:02:49] Dreamy_Jazz: anything else i can help with today? [14:02:49] Thanks again [14:02:58] No. That should be everything. [14:03:26] okay, great! [14:03:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:22] Issue found with Investigate reported as https://phabricator.wikimedia.org/T342655. Should be fine with early removal of CU rights if needed (noticed it expires a in around 50 mins) [14:08:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:10:04] (03CR) 10Vgutierrez: [C: 03+1] "LGTM, proceed with the usual caution with this one" [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [14:10:30] ty [14:11:13] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10Jhancock.wm) 05Open→03Resolved [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:44] 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10Jhancock.wm) DECOMed servers, ssd's removed, servers in storage, and updated in netbox. [14:13:15] (03PS1) 10Giuseppe Lavagetto: service::catalog: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941429 (https://phabricator.wikimedia.org/T341859) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:37] (03PS3) 10Ssingh: varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) [14:19:00] (03CR) 10Ssingh: "Thanks for the review! Addressed the comment." [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh) [14:19:54] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:29] (03CR) 10Vgutierrez: [C: 03+1] varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh) [14:24:22] !log start stopping services and rebooting lvs5006 (T335835) [14:24:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:59] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:27:12] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet [14:28:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:28:18] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet [14:29:02] !log disabling puppet on A:cp for rollout of r/941405 [14:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:29:15] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet [14:29:39] (03PS1) 10Zabe: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217) [14:29:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:30:00] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet [14:30:03] (03PS1) 10Zabe: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217) [14:30:16] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [14:30:28] (03CR) 10Hnowlan: [C: 03+2] trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan) [14:30:42] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet [14:33:16] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet [14:34:16] PROBLEM - pybal on lvs5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:34:24] PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:34:45] wtf? [14:34:49] oh... expected :_) [14:34:52] :D [14:35:06] see Valentin is always jumpy with pybal :D [14:35:17] "start stopping" [14:35:25] oxymoronic log lines by fabfur [14:35:33] you can wake him up from deep sleep by typing "pybal critical" [14:35:44] RECOVERY - pybal on lvs5006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:35:50] RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:35:55] !log lvs5006 rebooted and services restarted (T335835) [14:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:06] sukhe: wait, I'm the only one with a brain IRQ wired to my IRC client? [14:36:17] :] [14:36:21] That's called a taser [14:36:25] And it's illegal in most places [14:36:34] :p [14:36:50] claime: anything good is illegal in at least some place [14:37:04] bash [14:37:05] x) [14:37:21] sukhe: O:) [14:37:51] there's also "stop stopping" and "start starting" but I use them only in few occasions [14:38:02] https://bash.toolforge.org/quip/gXd8jYkBGiVuUzOdOLax [14:38:54] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [14:39:00] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:20] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:40:28] (03Merged) 10jenkins-bot: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [14:41:07] 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) [14:41:33] (03CR) 10Clément Goubert: [C: 03+1] Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:41:34] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH) [14:41:39] 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH) [14:41:42] (03CR) 10Jobo: [C: 03+2] groups: Add taavi to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/940269 (https://phabricator.wikimedia.org/T342307) (owner: 10Andrea Denisse) [14:42:08] (03PS1) 10Elukey: profile::logstash: allow more Istio ingress gateway logs [puppet] - 10https://gerrit.wikimedia.org/r/941434 [14:42:52] fabfur: begin and finish are your friends <3 [14:43:04] 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) 05Open→03Resolved ` ||/ Name Version Architecture Description +++-==============-===============-============-================================= ii pdns-recursor 4.8.4-1+wmf11u1 amd64... [14:43:21] !log begin rebooting lvs5004 (T335835) [14:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:48] (03CR) 10Clément Goubert: [C: 03+1] service::catalog: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941429 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:44:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:06] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [14:45:39] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [14:45:57] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [14:46:06] (03CR) 10Ssingh: [C: 03+2] varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh) [14:46:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [14:48:34] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:49:04] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:49:18] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:40] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [14:49:51] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH) [14:49:58] 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH) [14:50:55] jouncebot: nowandnext [14:50:56] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [14:50:56] In 1 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1600) [14:51:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe) [14:51:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe) [14:52:47] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet [14:54:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) FYI relevant past ticket on this particular set of permissions: T270438 [14:56:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:56:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:38] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:57:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:58:32] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:58:36] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons. [14:58:42] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:58:51] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:58:52] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [14:58:59] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:59:08] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:59:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [14:59:18] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:59:26] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:59:26] (03PS2) 10Giuseppe Lavagetto: Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) [14:59:32] (03CR) 10Giuseppe Lavagetto: [V: 03+2] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto) [15:01:08] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:01] (03Merged) 10jenkins-bot: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe) [15:08:07] (03Merged) 10jenkins-bot: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe) [15:08:39] !log zabe@deploy1002 Started scap: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]] [15:08:43] T335217: Add namespace translations in Mandailing - https://phabricator.wikimedia.org/T335217 [15:09:35] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:10:22] !log zabe@deploy1002 zabe: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:12:33] (03PS1) 10Hnowlan: trafficserver: route requests to proton via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/941440 (https://phabricator.wikimedia.org/T324678) [15:13:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:14:51] <_joe_> !log removing all tags for docker image openjdk-8-jdk T341115 [15:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:55] T341115: Rationalize and update the use of base images in our docker-pkg repositories - https://phabricator.wikimedia.org/T341115 [15:16:31] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]] (duration: 07m 51s) [15:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:35] T335217: Add namespace translations in Mandailing - https://phabricator.wikimedia.org/T335217 [15:17:05] <_joe_> !log removing all tags for docker image openjdk-8-jre T341115 [15:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:34] (03CR) 10Cwhite: [C: 03+2] profile::logstash: allow more Istio ingress gateway logs [puppet] - 10https://gerrit.wikimedia.org/r/941434 (owner: 10Elukey) [15:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:40] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:22:16] (03PS2) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) [15:22:58] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:23:47] !log dancy@deploy1002 Started deploy [releng/jenkins-deploy@97b4674] (releasing): (no justification provided) [15:24:21] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:25:17] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:25:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet [15:25:32] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5004.eqsin.wmnet [15:25:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet [15:25:56] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) [15:26:15] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:26:16] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) 05Open→03In progress [15:26:24] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [15:28:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:28:45] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet [15:29:10] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:29:40] (03PS1) 10FNegri: tcpircbot: add another port for cloud IRC logging [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) [15:30:10] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:33:03] (03PS3) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346) [15:33:05] (03PS4) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) [15:33:07] (03PS1) 10Hashar: Remove python3-build-jessie (Jessie is EOL) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941442 [15:33:09] (03PS1) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346) [15:33:11] (03PS1) 10Hashar: python-build: rebuild images for recent changes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941444 (https://phabricator.wikimedia.org/T342346) [15:36:14] (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) [15:36:36] RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:36:38] !log lvs5004 restarted and services are reactivating (T335835) [15:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:32] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:38:23] (03PS1) 10Cwhite: logstash: reroute istio-ingressgateway logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/941053 [15:38:56] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:40:30] (03CR) 10Filippo Giunchedi: "Untested but LGTM, though I believe it might cause a restart of docker daemon due to systemd unit change. There's probably a way to craft " [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [15:41:20] (03CR) 10Cwhite: [C: 03+2] logstash: reroute istio-ingressgateway logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/941053 (owner: 10Cwhite) [15:43:43] (03PS1) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) [15:43:48] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: as expected https://puppet-compiler.wmflabs.org/output/941445/42693/" [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) (owner: 10Arturo Borrero Gonzalez) [15:43:54] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+1] eqiad1: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) (owner: 10Arturo Borrero Gonzalez) [15:46:02] (03PS2) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) [15:50:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "oh damn, the what: section of the commit message describes the old approach" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [15:52:19] (03PS1) 10Giuseppe Lavagetto: varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577) [15:52:47] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10taavi) [15:57:13] !log dancy@deploy1002 Finished deploy [releng/jenkins-deploy@97b4674] (releasing): (no justification provided) (duration: 33m 26s) [15:57:52] (03PS1) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/941449 (https://phabricator.wikimedia.org/T342266) [15:58:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1600). [16:00:04] James_F and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] o/ [16:00:25] Heya. [16:00:46] I think mine were all deployed by now? [16:01:00] Oh, one of them was, one wasn't. [16:01:43] And for https://gerrit.wikimedia.org/r/c/operations/puppet/+/939757/ Alexandros said Traffic should merge, hmm. [16:03:40] taking a look 👋 [16:04:48] James_F: hm, yeah, I could push the button on that if absolutely necessary but I'd be more comfortable having a trafficologist on hand [16:05:03] I can look [16:05:24] amazing thank you [16:05:48] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115) [16:06:35] (03CR) 10Ssingh: [C: 03+1] Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:08:11] rzl: James_F: ok to merge then? [16:08:39] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42694/console" [puppet] - 10https://gerrit.wikimedia.org/r/940406 (owner: 10Ahmon Dancy) [16:08:39] good by me! [16:08:52] once you're done I'll go ahead with dancy's two [16:08:55] James_F: deploying, since you asked intiially [16:08:58] rzl: noed [16:09:01] sigh, noted [16:09:03] where are the t's [16:09:13] (03CR) 10Ssingh: [C: 03+2] Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [16:09:35] los bu no forgoen 😔 [16:09:49] haha [16:09:50] haha [16:11:44] all good, rolling out to the rest of the nodes [16:12:29] dancy: in the meantime, it doesn't look like there's any dependency or anything to test in between, I can just fire away with both, right? [16:12:43] Yes, they are unrelated changes. [16:12:55] 👍 [16:13:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:14:07] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:15:56] sukhe: Thabks! [16:16:12] James_F: done [16:16:39] thanks sukhe <3 going ahead with the others now [16:16:49] hanks [16:16:53] :) [16:16:59] haha [16:17:24] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Remove unreferenced hiera data [puppet] - 10https://gerrit.wikimedia.org/r/940406 (owner: 10Ahmon Dancy) [16:18:10] (03CR) 10RLazarus: [C: 03+2] Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani) [16:18:18] (03Abandoned) 10Jforrester: [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator [puppet] - 10https://gerrit.wikimedia.org/r/938295 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [16:19:07] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:19:12] !log begin rebooting lvs5005 (T335835) [16:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:33] (03PS1) 10Ssingh: dns6001: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941450 [16:21:16] dancy: puppet's finished on deploy1002 [16:21:33] Thanks! [16:22:02] (03CR) 10Ssingh: [C: 03+2] dns6001: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941450 (owner: 10Ssingh) [16:24:00] PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:24:04] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:24:12] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [16:24:29] (03CR) 10BryanDavis: "Looks like it would work. Comments inline about how a one-time restart of existing services might be avoided." [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [16:25:46] DNS alerts in drmrs also expected [16:25:54] er BGP alerts in drmrs because of DNS changes [16:26:27] (03CR) 10Jforrester: [C: 03+1] service::catalog: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [16:26:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns6001.wikimedia.org [16:27:13] (03PS1) 10Ssingh: Revert "dns6001: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941420 [16:28:01] 10SRE, 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) [16:29:34] PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:29:42] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:30:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6001.wikimedia.org [16:32:36] RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:32:44] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:27] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH) [16:33:44] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH) [16:34:33] (03CR) 10Ssingh: [C: 03+2] Revert "dns6001: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941420 (owner: 10Ssingh) [16:35:11] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10RobH) [16:35:27] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10RobH) [16:35:36] (03CR) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [16:35:38] (03PS2) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) [16:35:48] (03PS6) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 [16:35:50] (03PS21) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [16:36:01] (03PS3) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) [16:37:55] (03CR) 10Andrew Bogott: "Here's pcc output showing this as no longer modifying the docker line:" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [16:37:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet [16:39:20] (03CR) 10BryanDavis: [C: 03+1] "Diff on existing usage looks good to me. https://puppet-compiler.wmflabs.org/output/941031/42697/cloudweb1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [16:39:54] (03PS1) 10Ssingh: dns6002: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941452 [16:40:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons. [16:40:21] (03CR) 10Hnowlan: [C: 03+2] images: fix debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/938272 (owner: 10Hnowlan) [16:40:44] (03CR) 10Ssingh: [C: 03+2] dns6002: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941452 (owner: 10Ssingh) [16:41:07] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet [16:41:35] (03PS22) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [16:41:37] !log end rebooting lvs5005 (T335835) [16:41:37] (03PS1) 10Andrew Bogott: DO NOT MERGE, this is just a proof of concept [puppet] - 10https://gerrit.wikimedia.org/r/941454 [16:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:50] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) After discussing this with @Volans we think there's no need to modify the Spic... [16:42:10] RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:42:14] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:43:51] (03PS4) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638) [16:44:17] (03PS2) 10Andrew Bogott: DO NOT MERGE, this is just a proof of concept [puppet] - 10https://gerrit.wikimedia.org/r/941454 [16:44:19] (03PS23) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [16:44:27] (03Merged) 10jenkins-bot: images: fix debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/938272 (owner: 10Hnowlan) [16:45:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns6002.wikimedia.org [16:45:29] (03PS4) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) [16:46:02] (03CR) 10CI reject: [V: 04-1] Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [16:46:12] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [16:46:46] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:47:12] (03PS1) 10Ssingh: Revert "dns6002: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941421 [16:47:20] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:47:52] 10SRE, 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez) [16:49:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6002.wikimedia.org [16:49:44] (03CR) 10Ssingh: [C: 03+2] Revert "dns6002: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941421 (owner: 10Ssingh) [16:50:07] (03CR) 10Andrew Bogott: "Here's a diff where it actually adds something:" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [16:50:36] (03CR) 10BryanDavis: [C: 03+1] Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:50:36] (03PS1) 10Elukey: role::kafka::logging: apply threads settings to brokers [puppet] - 10https://gerrit.wikimedia.org/r/941455 [16:50:40] (03PS24) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [16:50:54] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Jhancock.wm) @aborrero I wanted to check with you about the cabling information on these servers. (planning out the racking ahe... [16:51:14] (03CR) 10Majavah: [C: 03+2] Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:51:20] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:22] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:51:36] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:51:46] (03Merged) 10jenkins-bot: Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah) [16:51:52] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:52:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42703/console" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey) [16:54:39] (03CR) 10Cwhite: [C: 03+1] "LGTM, especially if we get the nice improvement 😊" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey) [16:56:36] !log dummy authdns-update [16:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:22] (03PS1) 10Dwisehaupt: Remove frav1002 for decom [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1700) [17:07:59] (03PS1) 10Majavah: conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463) [17:08:02] (03PS1) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) [17:08:06] (03PS1) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463) [17:09:00] (03CR) 10Jgreen: [C: 03+2] Remove frav1002 for decom [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678) (owner: 10Dwisehaupt) [17:09:08] (03CR) 10Jgreen: [C: 03+2] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678) (owner: 10Dwisehaupt) [17:18:24] (03CR) 10Andrew Bogott: [C: 03+2] docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott) [17:36:17] (03PS1) 10Bernard Wang: Fix text showing on icon only buttons [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423 [17:39:22] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Allow disabling puppet on reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [17:41:32] (03PS1) 10Ladsgroup: beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683) [17:42:32] RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:54] (03PS1) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) [17:43:05] (03CR) 10CI reject: [V: 04-1] Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func) [17:43:18] (03PS2) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) [17:44:22] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:32] (03CR) 10Ladsgroup: [C: 03+2] beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [17:45:13] (03PS1) 10Andrew Bogott: keystone.conf: turn on character restrictions for new projects and domains. [puppet] - 10https://gerrit.wikimedia.org/r/941464 (https://phabricator.wikimedia.org/T341509) [17:45:16] (03PS1) 10Andrew Bogott: keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) [17:45:19] (03Merged) 10jenkins-bot: beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [17:45:22] (03PS3) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) [17:45:46] (03CR) 10CI reject: [V: 04-1] keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott) [17:47:34] (03PS1) 10Ssingh: dns4004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941466 [17:47:36] (03PS1) 10Ssingh: Revert "dns4004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941467 [17:47:50] (03PS2) 10Andrew Bogott: keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) [17:48:37] (03CR) 10Ssingh: [C: 03+2] dns4004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941466 (owner: 10Ssingh) [17:51:42] (03PS3) 10Andrew Bogott: keystone: hack to reject all new non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) [17:51:50] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:51:56] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:51:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns4004.wikimedia.org [17:53:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:55:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:55:34] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:56:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4004.wikimedia.org [17:58:00] (03CR) 10Ssingh: [C: 03+2] Revert "dns4004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941467 (owner: 10Ssingh) [17:58:25] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:58:32] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:59:54] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:00:05] jnuche and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1800). nyaa~ [18:00:51] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey) [18:01:45] (03PS1) 10Ssingh: dns4003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941469 [18:01:47] (03PS1) 10Ssingh: Revert "dns4003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941470 [18:02:16] (03CR) 10Ssingh: [C: 03+2] dns4003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941469 (owner: 10Ssingh) [18:03:48] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:03:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:05:14] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:06:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns4003.wikimedia.org [18:06:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:07:02] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:58] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:09:08] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:11:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4003.wikimedia.org [18:13:26] (03CR) 10Ssingh: [C: 03+2] Revert "dns4003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941470 (owner: 10Ssingh) [18:13:30] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:13:40] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:01] !log dummy authdns-update returns [18:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:07] !log dwisehaupt@cumin1001 START - Cookbook sre.dns.netbox [18:21:27] oh interesting time [18:21:29] we will see how this plays out [18:21:36] oops. :) [18:21:41] haha all good, my bad too :) [18:21:55] mine is just a decommissioning for frav1002 [18:21:56] I was just doing a dummy run to make sure everything is fine with all hosts [18:22:04] dwisehaupt: all yours [18:22:09] please feel free to run it [18:22:14] if you see any issues, please ping [18:22:24] will do, it should be done soonish. [18:23:26] !log dwisehaupt@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001" [18:24:13] !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001" [18:24:13] !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:24:42] sukhe: all good and clean. [18:24:46] dwisehaupt: thanks! [18:32:12] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Dwisehaupt) a:05Dwisehaupt→03Jclark-ctr Host powered off and ready for decom. [18:32:34] 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Dwisehaupt) [18:52:26] (03PS1) 10Krinkle: [BETA HACK] Make kafka_config default cluster logic actually work [puppet] - 10https://gerrit.wikimedia.org/r/941475 [18:52:28] (03PS1) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476 [18:52:30] (03PS1) 10Krinkle: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477 [18:52:32] (03PS1) 10Krinkle: [BETA HACK] confd: Fix confd hostname [puppet] - 10https://gerrit.wikimedia.org/r/941478 [18:52:34] (03PS1) 10Krinkle: [BETA HACK] scap: foreachwikiindblist: always filter for all-labs [puppet] - 10https://gerrit.wikimedia.org/r/941479 [18:53:06] (03CR) 10CI reject: [V: 04-1] [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle) [18:55:29] (03PS1) 10Dwisehaupt: Remove frmon1001 and frmon2001 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/941480 (https://phabricator.wikimedia.org/T342693) [18:55:33] (03CR) 10CI reject: [V: 04-1] [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle) [18:57:05] (03CR) 10Ssingh: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur) [18:59:01] (03CR) 10Bking: [C: 03+2] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer) [19:06:50] (03PS1) 10Bking: Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315) [19:09:13] (03CR) 10Ryan Kemper: [C: 03+1] Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315) (owner: 10Bking) [19:09:34] (03CR) 10Bking: [C: 03+2] Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315) (owner: 10Bking) [19:12:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi) [19:12:18] (03PS2) 10Ayounsi: WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [19:12:33] (03PS3) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [19:13:08] (03CR) 10Andrew Bogott: [C: 03+2] keystone.conf: turn on character restrictions for new projects and domains. [puppet] - 10https://gerrit.wikimedia.org/r/941464 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott) [19:13:18] (03CR) 10Andrew Bogott: [C: 03+2] keystone: hack to reject all new non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott) [19:13:50] (03PS10) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [19:13:52] (03PS13) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [19:13:54] (03CR) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester) [19:13:56] (03PS7) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [19:13:58] (03PS5) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [19:14:00] (03PS11) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [19:14:18] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [19:15:36] (03PS1) 10Andrew Bogott: keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 [19:15:49] (03CR) 10CI reject: [V: 04-1] keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 (owner: 10Andrew Bogott) [19:16:18] (03PS2) 10Andrew Bogott: keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 [19:17:49] (03CR) 10Andrew Bogott: [C: 03+2] keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 (owner: 10Andrew Bogott) [19:26:27] (03CR) 10Jgreen: [C: 03+1] "Looks good to me, ready for SRE to merge & deploy!" [puppet] - 10https://gerrit.wikimedia.org/r/941480 (https://phabricator.wikimedia.org/T342693) (owner: 10Dwisehaupt) [19:34:53] (03PS25) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) [19:37:20] (03CR) 10Jforrester: [C: 03+1] "Can this please be deployed so we can call it? :-) We're trying to go live tomorrow at 16:00 UTC, so it'd be great to have this in place a" [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [19:38:38] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T2000). [20:00:05] bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] Hello. I'm subbing for bwang [20:01:24] I cannot deploy this evening [20:01:29] tyty [20:01:46] I can deploy [20:02:08] taavi: thanks [20:03:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423 (owner: 10Bernard Wang) [20:04:33] and now we wait for CI [20:10:02] (03PS1) 10Andrew Bogott: horizon/docker: fix (maybe) the namespace and image name [puppet] - 10https://gerrit.wikimedia.org/r/941512 [20:15:48] (03PS1) 10Tsevener: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941514 (https://phabricator.wikimedia.org/T341896) [20:18:12] (03PS14) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [20:18:14] (03PS8) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [20:18:16] (03PS6) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [20:18:18] (03PS12) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [20:18:20] (03PS1) 10Jforrester: [DNM] Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515 [20:21:14] (03Merged) 10jenkins-bot: Fix text showing on icon only buttons [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423 (owner: 10Bernard Wang) [20:21:42] !log taavi@deploy1002 Started scap: Backport for [[gerrit:941423|Fix text showing on icon only buttons]] [20:23:17] !log taavi@deploy1002 taavi and bwang: Backport for [[gerrit:941423|Fix text showing on icon only buttons]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:23:19] (03CR) 10Andrew Bogott: [C: 03+2] horizon/docker: fix (maybe) the namespace and image name [puppet] - 10https://gerrit.wikimedia.org/r/941512 (owner: 10Andrew Bogott) [20:23:24] kimberly_sarabia: please test [20:23:40] taavi: ack [20:26:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:27:31] taavi: LGTM [20:27:46] thanks, syncing [20:27:55] (03PS3) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) [20:31:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:32:46] (03PS1) 10Andrew Bogott: Horizon/docker: fix bind mount typo [puppet] - 10https://gerrit.wikimedia.org/r/941518 (https://phabricator.wikimedia.org/T341640) [20:33:50] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:941423|Fix text showing on icon only buttons]] (duration: 12m 08s) [20:33:55] all done! [20:34:07] anyone have anything else to deploy? [20:35:32] (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: fix bind mount typo [puppet] - 10https://gerrit.wikimedia.org/r/941518 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [20:40:05] o/ [20:40:21] (03CR) 10Zabe: [C: 03+2] Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) (owner: 10Zabe) [20:41:01] (03Merged) 10jenkins-bot: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) (owner: 10Zabe) [20:42:55] !log create Wiktionary Mandailing # T335216 [20:42:58] zabe: <3 [20:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:59] T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216 [20:43:08] I'm around, let me know if things go weeeeee [20:43:17] addwiki ran through without errors [20:43:30] Wohoooo [20:43:56] just in time for wikifunctions :D [20:44:47] !log zabe@deploy1002 Started scap: T335216 [20:46:25] !log zabe@deploy1002 zabe: T335216 synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:48:23] (03PS1) 10Zabe: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655) [20:48:33] (03PS1) 10Zabe: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655) [20:48:46] (03CR) 10Zabe: [C: 03+2] Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe) [20:48:54] (03CR) 10Zabe: [C: 03+2] Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe) [20:50:55] zabe: Thank you! [20:51:12] Of course, we need the service to go live first. :-) [20:52:28] yw :) [20:52:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:53:11] !log zabe@deploy1002 Finished scap: T335216 (duration: 08m 24s) [20:53:16] T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216 [20:54:36] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057 [20:54:38] (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057 (owner: 10Zabe) [20:55:12] !log zabe@deploy1002 Started scap: update interwiki cache, [[gerrit:941057]] [20:55:32] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057 (owner: 10Zabe) [20:57:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:58:13] (03PS1) 10Andrew Bogott: horizon/docker: move to port 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941521 (https://phabricator.wikimedia.org/T341640) [21:01:44] (03CR) 10Andrew Bogott: [C: 03+2] horizon/docker: move to port 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941521 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [21:02:33] !log zabe@deploy1002 Finished scap: update interwiki cache, [[gerrit:941057]] (duration: 07m 20s) [21:02:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:05:49] (03Merged) 10jenkins-bot: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe) [21:07:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:07:46] (03Merged) 10jenkins-bot: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe) [21:08:19] !log zabe@deploy1002 Started scap: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]] [21:08:23] T342655: Special:Investigate: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki - https://phabricator.wikimedia.org/T342655 [21:10:01] !log zabe@deploy1002 zabe: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:18:26] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]] (duration: 10m 06s) [21:18:30] T342655: Special:Investigate: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki - https://phabricator.wikimedia.org/T342655 [21:32:49] (03PS1) 10Andrew Bogott: horizon: use the in-container path for static resources in codfw [puppet] - 10https://gerrit.wikimedia.org/r/941527 (https://phabricator.wikimedia.org/T341640) [21:35:06] (03CR) 10Andrew Bogott: [C: 03+2] horizon: use the in-container path for static resources in codfw [puppet] - 10https://gerrit.wikimedia.org/r/941527 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [21:43:29] (03PS1) 10Andrew Bogott: Horizon/docker: another move from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941528 (https://phabricator.wikimedia.org/T341640) [21:46:18] (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: another move from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941528 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [21:47:04] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for adding the test too!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) (owner: 10FNegri) [21:51:59] (03CR) 10Volans: [C: 04-1] "Minor errors inside, LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) (owner: 10FNegri) [21:54:26] (03PS3) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [21:54:51] (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:55:50] (03PS1) 10Andrew Bogott: Horizon/docker: yet more moves from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941529 (https://phabricator.wikimedia.org/T341640) [21:56:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:35] (03CR) 10Volans: "Reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [21:57:40] (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: yet more moves from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941529 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott) [21:59:54] (03PS4) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [22:00:18] (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [22:01:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:08:28] (03PS5) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) [22:08:52] (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [22:21:19] (03CR) 10Cwhite: [C: 03+2] logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:23:01] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi) [23:15:30] (03PS1) 10Ladsgroup: Replace the look with Wikimedia UI [software/bitu] - 10https://gerrit.wikimedia.org/r/941535 [23:27:16] (03PS2) 10Ladsgroup: Replace the look with Wikimedia UI [software/bitu] - 10https://gerrit.wikimedia.org/r/941535 [23:46:44] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state