[00:19:02] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:20] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1013.eqiad.wmnet with OS bullseye
[00:34:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye
[00:38:35] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225
[00:38:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225 (owner: 10TrainBranchBot)
[00:55:11] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/940225 (owner: 10TrainBranchBot)
[01:03:34] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342592 (10phaultfinder)
[01:16:09] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1013.eqiad.wmnet with OS bullseye
[01:16:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**)   - Rem...
[01:17:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host rdb1014.eqiad.wmnet with OS bullseye
[01:17:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye
[01:30:10] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host rdb1014.eqiad.wmnet with OS bullseye
[01:30:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1014.eqiad.wmnet with OS bullseye executed with errors: - rdb1014 (**FAIL**)   - Rem...
[01:34:46] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:45:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:19] <wikibugs>	 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10TomerLerner) Thank you @akosiaris  We can only run client requests in the production URL, I guess it'll do for now until we com...
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0200)
[02:03:48] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:37] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247)
[02:07:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[02:18:32] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:18:34] <wikibugs>	 (03PS10) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:18:36] <wikibugs>	 (03PS1) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[02:18:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:19:02] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:22:54] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.19 [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941047 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[02:24:19] <wikibugs>	 (03PS11) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:24:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:31:08] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:59] <wikibugs>	 (03PS12) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:32:22] <wikibugs>	 (03CR) 10jenkins-bot: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:32:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:35:51] <wikibugs>	 (03PS13) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:36:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:38:12] <wikibugs>	 (03PS14) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:38:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:41:46] <wikibugs>	 (03PS2) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[02:41:48] <wikibugs>	 (03PS15) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:42:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:45:40] <icinga-wm>	 RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:29] <wikibugs>	 (03PS3) 10Andrew Bogott: docker-service-shim.erb: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[02:52:31] <wikibugs>	 (03PS16) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992
[02:52:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (owner: 10Andrew Bogott)
[02:57:02] <wikibugs>	 (03PS4) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[02:57:04] <wikibugs>	 (03PS17) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[02:57:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[02:59:31] <wikibugs>	 (03PS1) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034
[03:00:06] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0300)
[03:04:33] <wikibugs>	 (03PS5) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[03:04:35] <wikibugs>	 (03PS18) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[03:04:37] <wikibugs>	 (03PS2) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034
[03:04:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[03:08:48] <wikibugs>	 (03PS19) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[03:11:24] <wikibugs>	 (03Abandoned) 10Andrew Bogott: just to test the compiler... [puppet] - 10https://gerrit.wikimedia.org/r/941034 (owner: 10Andrew Bogott)
[03:13:35] <wikibugs>	 (03PS20) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[04:44:49] <wikibugs>	 (03PS1) 10Ryan Kemper: decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035)
[04:56:39] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs[2004-2006].codfw.wmnet
[04:57:18] <wikibugs>	 (03PS2) 10Ryan Kemper: decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035)
[05:08:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] decom wdqs200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/941037 (https://phabricator.wikimedia.org/T342035) (owner: 10Ryan Kemper)
[05:08:32] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts wdqs[2004-2006].codfw.wmnet
[05:09:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs[2004-2006].codfw.wmnet
[05:21:42] <wikibugs>	 (03CR) 10Ryan Kemper: "I'll take a note for Brian/I to get this reviewed and deployed this week." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[05:27:50] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10RKemper)
[05:46:17] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox
[05:52:29] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0600).
[06:10:38] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs[2004-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[06:10:38] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:10:39] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs[2004-2006].codfw.wmnet
[06:12:46] <wikibugs>	 (03PS1) 10Marostegui: db1213: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941040 (https://phabricator.wikimedia.org/T334650)
[06:13:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1213 (s5, s6)', diff saved to https://phabricator.wikimedia.org/P49680 and previous config saved to /var/cache/conftool/dbconfig/20230725-061319-root.json
[06:14:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1213: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/941040 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui)
[06:17:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49681 and previous config saved to /var/cache/conftool/dbconfig/20230725-061742-root.json
[06:17:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49682 and previous config saved to /var/cache/conftool/dbconfig/20230725-061753-root.json
[06:21:14] <wikibugs>	 (03PS1) 10Marostegui: pc1015,pc1016: New hosts to be set up [puppet] - 10https://gerrit.wikimedia.org/r/941042 (https://phabricator.wikimedia.org/T342164)
[06:21:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1015,pc1016: New hosts to be set up [puppet] - 10https://gerrit.wikimedia.org/r/941042 (https://phabricator.wikimedia.org/T342164) (owner: 10Marostegui)
[06:32:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49683 and previous config saved to /var/cache/conftool/dbconfig/20230725-063247-root.json
[06:32:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49684 and previous config saved to /var/cache/conftool/dbconfig/20230725-063258-root.json
[06:33:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "Two open comments from me on PS9" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[06:47:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49685 and previous config saved to /var/cache/conftool/dbconfig/20230725-064751-root.json
[06:48:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49686 and previous config saved to /var/cache/conftool/dbconfig/20230725-064802-root.json
[07:00:06] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0700).
[07:00:06] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:02:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49687 and previous config saved to /var/cache/conftool/dbconfig/20230725-070256-root.json
[07:03:02] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 131 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:03:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49688 and previous config saved to /var/cache/conftool/dbconfig/20230725-070307-root.json
[07:08:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] apache: Enable view_urls on wikifunctions.org [puppet] - 10https://gerrit.wikimedia.org/r/940246 (https://phabricator.wikimedia.org/T338190) (owner: 10Jforrester)
[07:18:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49689 and previous config saved to /var/cache/conftool/dbconfig/20230725-071801-root.json
[07:18:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49690 and previous config saved to /var/cache/conftool/dbconfig/20230725-071812-root.json
[07:19:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:30:28] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49691 and previous config saved to /var/cache/conftool/dbconfig/20230725-073305-root.json
[07:33:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49692 and previous config saved to /var/cache/conftool/dbconfig/20230725-073317-root.json
[07:40:06] <wikibugs>	 (03PS1) 10JMeybohm: wmnet: Add cnames for'wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314)
[07:40:28] <wikibugs>	 (03PS2) 10JMeybohm: wmnet: Add cnames for wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314)
[07:48:02] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[07:48:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49693 and previous config saved to /var/cache/conftool/dbconfig/20230725-074810-root.json
[07:48:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49694 and previous config saved to /var/cache/conftool/dbconfig/20230725-074821-root.json
[07:51:12] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314)
[07:51:14] <wikibugs>	 (03PS1) 10JMeybohm: service::catalog: Switch wikifunctions to state production [puppet] - 10https://gerrit.wikimedia.org/r/941314 (https://phabricator.wikimedia.org/T297314)
[07:51:52] <wikibugs>	 (03PS1) 10Elukey: role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558)
[07:53:52] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42684/console" [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[07:55:56] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247)
[07:55:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[07:56:38] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941316 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[07:57:04] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.19  refs T340247
[07:57:08] <stashbot>	 T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247
[08:00:05] <jouncebot>	 jnuche and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T0800).
[08:01:04] <jnuche>	 morning, the train pre-sync failed last night
[08:01:26] <jnuche>	 I think I've fixed the issue (rebased sec patch needed to be applied) and I'm rerunning the pre-sync now
[08:03:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3315 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49695 and previous config saved to /var/cache/conftool/dbconfig/20230725-080315-root.json
[08:03:20] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:03:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1213:3316 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49696 and previous config saved to /var/cache/conftool/dbconfig/20230725-080326-root.json
[08:03:58] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:04:48] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:09:18] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:14:22] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[08:15:12] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:22:36] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:24:18] <wikibugs>	 (03CR) 10Vgutierrez: "looks good but I think we should consider performance as well:" [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh)
[08:24:56] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[08:25:42] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:26:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto)
[08:27:20] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add ingress support [deployment-charts] - 10https://gerrit.wikimedia.org/r/940189 (https://phabricator.wikimedia.org/T342356) (owner: 10Giuseppe Lavagetto)
[08:30:18] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:30:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:31:47] <wikibugs>	 (03PS2) 10Elukey: role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558)
[08:33:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42685/console" [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[08:33:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[08:34:29] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: increase worker threads for kafka-main1001 [puppet] - 10https://gerrit.wikimedia.org/r/941315 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[08:35:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:35:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker
[08:35:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker
[08:40:17] <wikibugs>	 (03PS1) 10Jelto: idp: remove nda from required_groups for gitlab_replica_oidc [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390)
[08:41:40] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "I doubt that makes much of a difference, but I see no reason to not give it a try." [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[08:43:33] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) I did a diff of the configurations for idp and idp-test, and they are basically the same, none of the settings tha...
[08:49:38] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.19  refs T340247 (duration: 52m 35s)
[08:49:43] <stashbot>	 T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247
[08:49:49] <wikibugs>	 (03PS1) 10Elukey: profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558)
[08:51:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42686/console" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[08:51:52] <logmsgbot>	 !log jnuche@deploy1002 Pruned MediaWiki: 1.41.0-wmf.17 (duration: 02m 11s)
[08:52:21] <wikibugs>	 (03PS1) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920
[08:52:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 (owner: 10Elukey)
[08:53:43] <jnuche>	 pre-sync done, deploying train to group0 now
[08:53:53] <wikibugs>	 (03PS2) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920
[08:54:05] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247)
[08:54:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[08:54:48] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941365 (https://phabricator.wikimedia.org/T340247) (owner: 10TrainBranchBot)
[08:56:52] <wikibugs>	 (03Abandoned) 10Elukey: Revert "role::kafka::main: increase worker threads for kafka-main1001" [puppet] - 10https://gerrit.wikimedia.org/r/940920 (owner: 10Elukey)
[08:57:26] <wikibugs>	 (03PS13) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[08:58:52] <wikibugs>	 (03PS2) 10Elukey: profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558)
[08:59:03] <_joe_>	 jouncebot: next
[08:59:03] <jouncebot>	 In 1 hour(s) and 0 minute(s): MediaWiki-related infrastuctural changes (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1000)
[08:59:25] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[08:59:28] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[08:59:43] <wikibugs>	 (03PS1) 10Fabfur: Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T321309)
[09:00:11] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42687/console" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[09:00:22] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Increase the number of max volumes for production pool [puppet] - 10https://gerrit.wikimedia.org/r/941368
[09:01:32] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.19  refs T340247
[09:01:36] <stashbot>	 T340247: 1.41.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T340247
[09:01:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42688/console" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[09:02:53] <wikibugs>	 (03PS2) 10Fabfur: Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154)
[09:03:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:06:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:06:48] <slyngs>	 !log Restart Tomcat / Apereo CAS on idp1002
[09:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:10:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Version 6.0.11-1wm2 for Debian Bookworm [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur)
[09:10:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[09:12:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "We could also discuss not rolling the "changes" out at all to jumbo and logging if we don't have issues there..." [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[09:14:05] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:24:29] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:30:03] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:31:53] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) Yeah, this is my concern, too - we used to spawn extra requests to copy new thumbnails to the other DC and that ca...
[09:33:07] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:33:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[09:34:09] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:37:21] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:39:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:41:51] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:44:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:45:41] <icinga-wm>	 PROBLEM - SSH on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:46:23] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:46:39] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1013 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:39] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10Jelto) We restarted `idp1002` and `idp-test1002`. It seems the running configuration was not the one configured, because tomcat is...
[09:46:47] <icinga-wm>	 RECOVERY - SSH on wdqs1013 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:46:47] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::broker: fix settings passed to the confluent class [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[09:46:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker
[09:47:11] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main1001.eqiad.wmnet with reason: Apply a new setting to the Kafka broker
[09:47:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:48:23] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:48:42] <jinxer-wm>	 (SystemdUnitFailed) firing: nginx.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:48:57] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[09:49:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] api-gateway: change liftwing hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/940945 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[09:50:55] <elukey>	 !log restart kafka on kafka-main1001 to pick up the new changes - T341558
[09:50:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:59] <stashbot>	 T341558: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558
[09:52:46] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118)
[09:53:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: nginx.service Failed on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:54:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[09:55:07] <dcausse>	 ^ expected
[09:56:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Increase the number of max volumes for production pool [puppet] - 10https://gerrit.wikimedia.org/r/941368 (owner: 10Jcrespo)
[09:59:13] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) It might sound a bit stupid: Why not just gradually, slowly, roll delete all thumbnails, if it's needed, it'll be rege...
[09:59:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[10:00:04] <jouncebot>	 akosiaris: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki-related infrastuctural changes deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1000).
[10:00:05] <wikibugs>	 (03PS14) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[10:00:12] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[10:00:44] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[10:01:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[10:01:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: Kubernetes: add support for deployment apparmor profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[10:01:53] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[10:03:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:04:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Cool! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[10:06:59] <akosiaris>	 I haven't yet started the wikidiff2 deploy, doing some unexpected prepwork
[10:11:16] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:11:34] <marostegui>	 woot
[10:11:36] <Emperor>	 Ugh
[10:12:13] <elukey>	 latency went up and now it is trending down
[10:12:35] <Emperor>	 this is the thing we got p.aged about on Saturday too I think
[10:12:45] <marostegui>	 yeah
[10:13:00] <Emperor>	 think I'll reopen T342085
[10:13:01] <stashbot>	 T342085: Increase to >3s for parsoid average get/200 latency since 2023-7-15 12:30 - https://phabricator.wikimedia.org/T342085
[10:15:01] <Emperor>	 yeah, looking at https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?from=1689984000000&orgId=1&to=1690329599000&var-cluster=parsoid&var-datasource=eqiad+prometheus%2Fops&var-method=GET the latency is up again
[10:16:07] <claime>	 Emperor: There's a big increase in timeouts, and they seem to be mostly coming from two userpages
[10:16:16] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki parsoid at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=parsoid&var-site=eqiad&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[10:16:20] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10BTullis) Hello. I'm listed as one of the approvers for this group, but there are a couple of things that I would like to check first, before proc...
[10:17:45] <wikibugs>	 (03PS1) 10Amire80: Remove ak from wgImportSources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941372 (https://phabricator.wikimedia.org/T333765)
[10:18:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:19:36] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373
[10:21:47] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[10:23:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[10:26:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Bit late to the party, but this is fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/941362 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[10:27:21] <wikibugs>	 (03PS1) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400)
[10:27:23] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Exclude nagios checks of tmpfs mounts on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[10:27:36] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Install the ceph-volume and hdparm packages on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941010 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[10:28:16] <wikibugs>	 (03PS2) 10Btullis: Exclude nagios checks of tmpfs mounts on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941014 (https://phabricator.wikimedia.org/T330151)
[10:28:44] <wikibugs>	 (03PS2) 10Hnowlan: WIP helmfile: add namespace and service definition for geo-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400)
[10:29:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:37:01] <wikibugs>	 (03PS1) 10Ladsgroup: Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617)
[10:40:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "whitespace nits, but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[10:41:15] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Comment on the modules/base/values.yaml question, but I think it's fine either way" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[10:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:45:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378
[10:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:48:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[10:49:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:50:51] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:50:53] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 31 days, 0:00:00 on lvs[1013-1015].eqiad.wmnet with reason: test hosts
[10:51:08] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 31 days, 0:00:00 on lvs[1013-1015].eqiad.wmnet with reason: test hosts
[10:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:57:36] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] cache: set api.wikimedia.org to normal caching [puppet] - 10https://gerrit.wikimedia.org/r/937061 (https://phabricator.wikimedia.org/T338916) (owner: 10Hnowlan)
[10:59:29] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[10:59:57] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] wmnet: Add cnames for wikifunctions ingress [dns] - 10https://gerrit.wikimedia.org/r/941312 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[11:00:58] <wikibugs>	 (03PS2) 10Hnowlan: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno)
[11:02:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup)
[11:03:34] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378
[11:03:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[11:05:21] <wikibugs>	 (03PS4) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033)
[11:05:58] <wikibugs>	 (03PS3) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:06:05] <wikibugs>	 (03PS4) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:06:08] <wikibugs>	 (03CR) 10MVernon: "So this is splitting out the parsoid alerts to now page after 5m of <50% idle workers, and also returning the previous alerts (for <30% id" [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[11:06:10] <wikibugs>	 (03CR) 10Jforrester: admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:06:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10aborrero)
[11:07:35] <wikibugs>	 (03PS5) 10Jforrester: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:08:27] <wikibugs>	 (03PS8) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[11:09:13] <wikibugs>	 (03PS1) 10Btullis: Stop repeatedly disabling write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151)
[11:10:42] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42689/console" [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[11:11:20] <wikibugs>	 (03CR) 10JMeybohm: CI: Generate deployment fixtures from actual hiera data (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939315 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[11:12:06] <wikibugs>	 (03CR) 10Klausman: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[11:12:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[11:12:45] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[11:14:31] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Stop repeatedly disabling write cache on cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/941380 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis)
[11:15:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:21:54] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161)
[11:22:48] <TheresNoTime>	 akosiaris: hi, just checking in, did wikidiff2 get deployed or are you still on the unexpected work? :-)
[11:23:06] <akosiaris>	 TheresNoTime: starting right now
[11:23:14] <TheresNoTime>	 ack :)
[11:24:17] <akosiaris>	 !log T340087 starting wikidiff2 1.41.1 rollout to codfw
[11:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:21] <stashbot>	 T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087
[11:24:29] <James_F>	 Ooh, fancy.
[11:24:54] <James_F>	 … how do we get downstreams like Debian to pick up new releases of wikidiff2?
[11:25:08] <akosiaris>	 !log T340087 keep a copy php-wikidiff2_1.13.0-1_amd64.deb in apt1001:/home/akosiaris/wd/ in case of emergency
[11:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:25] <akosiaris>	 James_F: I assume we ping legok.tm  :-)
[11:25:54] <James_F>	 OK, fair, but what about randoms like 1&1 MW hosting etc. :-)
[11:26:03] <James_F>	 Do we just hope they notice?
[11:26:03] <akosiaris>	 Maintainer: MediaWiki packaging team <mediawiki-debian@lists.wikimedia.org>
[11:26:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw1457 is CRITICAL: CRITICAL: opcache full on php 7.4. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[11:26:31] <akosiaris>	 hmm this one ^ has nothing to do with my change. /me looking
[11:26:35] <James_F>	 I don't think I've consciously seen us ever send a note to mediawiki-announce or whatever.
[11:27:14] <akosiaris>	 James_F: I don't think so either.
[11:27:15] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161)
[11:27:34] <James_F>	 Something for the new MW group to think about.
[11:27:55] <taavi>	 James_F: https://tracker.debian.org/news/1443137/accepted-wikidiff2-1141-1-source-into-unstable/
[11:28:03] <akosiaris>	 !log restart php on mw1457
[11:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:14] <James_F>	 taavi: Of course you're already on this. :-)
[11:28:16] * akosiaris waiting for this to clear out and then proceeding with wikidiff2 in eqiad
[11:29:15] <icinga-wm>	 RECOVERY - PHP opcache health on mw1457 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[11:29:58] <akosiaris>	 !log T340087 starting wikidiff2 1.41.1 rollout to eqiad. codfw already done. 
[11:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:02] <stashbot>	 T340087: Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087
[11:32:17] <akosiaris>	 !log T340087 wikidiff2 rollout done. 1 host is unreachable and will need to be reimaged or upgraded manually to pick this up, parse1002.eqiad.wmnet
[11:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:24] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno)
[11:32:26] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] mediawiki: differentiate parsoid alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[11:32:28] <akosiaris>	 TheresNoTime: And we are done. 
[11:32:37] <TheresNoTime>	 woo, thank you :)
[11:32:52] <akosiaris>	 yw
[11:33:58] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:34:55] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 131 jobs Jcrespo backups now catching up after storage issue solved - The acknowledgement expires at: 2023-07-26 11:34:11. https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:35:14] <jynus>	 ^ Emperor, marostegui
[11:35:25] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero)
[11:35:33] <marostegui>	 thanks
[11:35:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/940152 (https://phabricator.wikimedia.org/T326785) (owner: 10JMeybohm)
[11:36:29] <wikibugs>	 (03Merged) 10jenkins-bot: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno)
[11:36:40] <wikibugs>	 (03Abandoned) 10Jelto: idp: remove nda from required_groups for gitlab_replica_oidc [puppet] - 10https://gerrit.wikimedia.org/r/941319 (https://phabricator.wikimedia.org/T320390) (owner: 10Jelto)
[11:36:53] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001"
[11:37:39] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack - aborrero@cumin1001"
[11:37:39] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:39:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:40:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Add wikifunctions apparmor profiles to PSP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:40:47] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudservices1006: prepare service [puppet] - 10https://gerrit.wikimedia.org/r/941383 (https://phabricator.wikimedia.org/T342161)
[11:41:42] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Add wikifunctions apparmor profiles to PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/940371 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:45:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:45:56] <wikibugs>	 10SRE, 10CAS-SSO, 10Infrastructure-Foundations, 10collaboration-services, and 4 others: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10SLyngshede-WMF) We previously suspected that the issue was that CAS nested the attributes it returns via the profile, but gave up...
[11:46:12] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 04-1] [DNM] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[11:46:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: NetworkProbeLimit cookie should set samesite attribute - https://phabricator.wikimedia.org/T342624 (10Reedy)
[11:46:57] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:47:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup)
[11:47:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[11:47:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add make_el_to_nullable_T342617.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/941375 (https://phabricator.wikimedia.org/T342617) (owner: 10Ladsgroup)
[11:48:05] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[11:48:15] <wikibugs>	 (03PS15) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480)
[11:48:48] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:48:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:49:41] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:49:50] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:54:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[11:54:59] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) (owner: 10Alexandros Kosiaris)
[12:02:25] <wikibugs>	 (03PS1) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390)
[12:05:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42690/console" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede)
[12:06:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:06:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:06:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49699 and previous config saved to /var/cache/conftool/dbconfig/20230725-120641-ladsgroup.json
[12:06:45] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[12:08:30] <wikibugs>	 (03PS2) 10Slyngshede: D:apereo_cas::service support FLAT profiles. [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390)
[12:08:43] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, this should only affect idp-test and gitlab-replica. We should keep in mind that we set NESTED for all other OIDC clients as well wi" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede)
[12:11:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42691/console" [puppet] - 10https://gerrit.wikimedia.org/r/941391 (https://phabricator.wikimedia.org/T320390) (owner: 10Slyngshede)
[12:12:12] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[12:12:14] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[12:12:16] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[12:12:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[12:16:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[12:18:14] <wikibugs>	 (03PS8) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[12:18:16] <wikibugs>	 (03PS10) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[12:18:18] <wikibugs>	 (03PS10) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[12:18:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[12:24:11] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843)
[12:24:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: deployment: Support making k8s deploys db section aware (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[12:27:24] <wikibugs>	 (03PS1) 10Elukey: role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558)
[12:30:48] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42692/console" [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[12:34:39] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:36:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49700 and previous config saved to /var/cache/conftool/dbconfig/20230725-123602-ladsgroup.json
[12:36:08] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[12:41:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi)
[12:43:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10Clement_Goubert) Note for #serviceops later: once fixed, the host will need to be updated to pick up {T340087}
[12:45:13] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:29] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[12:49:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] deployment: Support making k8s deploys db section aware [puppet] - 10https://gerrit.wikimedia.org/r/940323 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[12:49:15] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T342592 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[12:50:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[12:51:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P49701 and previous config saved to /var/cache/conftool/dbconfig/20230725-125109-ladsgroup.json
[12:55:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::main: apply new threads settings to all brokers [puppet] - 10https://gerrit.wikimedia.org/r/941396 (https://phabricator.wikimedia.org/T341558) (owner: 10Elukey)
[12:58:54] <wikibugs>	 (03PS1) 10Jelto: aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1300).
[13:00:05] <jouncebot>	 Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1300)
[13:00:13] <Dreamy_Jazz>	 \o
[13:00:24] <taavi>	 I can deploy in a few moments, just wrapping up another thing
[13:00:38] <wikibugs>	 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T342565 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated power cord, alert did not clear. reseated PSU1, alert cleared.
[13:00:49] <taavi>	 (if someone else is around, feel free to go ahead)
[13:00:59] <wikibugs>	 (03PS9) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[13:01:01] <urbanecm>	 i can deploy today
[13:01:08] <urbanecm>	 hi Dreamy_Jazz 
[13:01:12] <Dreamy_Jazz>	 Hello.
[13:01:20] <taavi>	 thx urbanecm
[13:01:31] <Dreamy_Jazz>	 I'm also coordinating with Ladsgroup to check that the tables are not being replicated to cloud DBs
[13:01:41] <urbanecm>	 great
[13:02:00] <urbanecm>	 Dreamy_Jazz: do we have all relevant code in wmf.19 (maybe even .18) already?
[13:02:09] <Dreamy_Jazz>	 As far as I am aware, yes
[13:02:14] <Dreamy_Jazz>	 But let me double check
[13:03:09] <Dreamy_Jazz>	 wmf.19 has all the relevant changes
[13:03:29] <Dreamy_Jazz>	 Including the moving of the default value to write new (as well as write and read old)
[13:03:32] <urbanecm>	 but not .18, afaik. 
[13:03:37] <Dreamy_Jazz>	 Yes
[13:03:42] <Dreamy_Jazz>	 Not wmf.18
[13:03:44] <urbanecm>	 which means a train rollback can cause the code to be missing
[13:03:50] <urbanecm>	 what would happen in that scenario?
[13:04:33] * Lucas_WMDE also around now but probably not needed :)
[13:04:36] <Dreamy_Jazz>	 I'm not sure
[13:04:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[13:04:55] <Dreamy_Jazz>	 Though what I think would happen is that the code would only write old
[13:05:08] <Dreamy_Jazz>	 But let me check that
[13:05:18] <urbanecm>	 okay
[13:05:29] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 132 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[13:05:57] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: differentiate parsoid alerts [alerts] - 10https://gerrit.wikimedia.org/r/941378 (owner: 10Giuseppe Lavagetto)
[13:06:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P49702 and previous config saved to /var/cache/conftool/dbconfig/20230725-130615-ladsgroup.json
[13:06:40] <wikibugs>	 10SRE, 10ops-codfw: Decommission asw-b1-codfw - https://phabricator.wikimedia.org/T342076 (10Papaul) 05Open→03Resolved a:03Papaul This s complete
[13:06:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: revscoring template change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[13:07:07] <Dreamy_Jazz>	 Unfortunately, it looks like at least some of the code would end up only writing new
[13:07:33] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[13:07:37] <Dreamy_Jazz>	 So if you think the possibility of a train rollback on testwiki is too high, then perhaps waiting to next week you think?
[13:09:52] <Dreamy_Jazz>	 urbanecm: ^?
[13:09:54] <urbanecm>	 if train rollback means write new behavior in some branches, i think that's a significant issue, as it'll give us inconsistent data. i'm afraid fixing it could get difficult, esp. if the `cuc_only_for_read_old`gets inconsistent as well
[13:10:10] <wikibugs>	 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) 05Open→03Resolved The test server has been returned so we are good to close this task.
[13:10:10] <urbanecm>	 i think waiting for next week is a good idea. alternatively, we can backport stuff.
[13:10:21] <godog>	 there will be icinga config failure alerts for alert hosts, that is me
[13:10:31] <urbanecm>	 but afaik it's quite a lot of patches to backport through?
[13:10:58] <wikibugs>	 (03CR) 10Jon Harald Søby: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[13:11:19] <Dreamy_Jazz>	 I think it would only be one patch that would need backporting
[13:11:33] <Dreamy_Jazz>	 Actually would be two because of the follow-up
[13:12:04] <Dreamy_Jazz>	 It would be https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CheckUser/+/59888cddf495de7ea2b4ec6ff563f9543713281b and https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/CheckUser/+/213281cf65997cd52701845208307b39e06e9163
[13:12:30] <urbanecm>	 gotcha. then, up2you. happy to backport those two if you want to do this earlier rather than later.
[13:13:05] <Dreamy_Jazz>	 It would be good to test both replication to cloud DBs and have a good amount of testing time on testwiki
[13:13:17] <urbanecm>	 let's backport then :)
[13:13:21] <Dreamy_Jazz>	 Thanks :)
[13:13:34] <wikibugs>	 (03PS1) 10Urbanecm: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934)
[13:14:16] <Dreamy_Jazz>	 The other patches that were needed to change the default modify code that wouldn't be run unless update.php is run and/or those maintenance scripts are run. This won't happen unless someone was to run them manually.
[13:14:20] <wikibugs>	 (03PS1) 10Urbanecm: Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586)
[13:14:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm)
[13:14:30] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm)
[13:15:47] <urbanecm>	 gotcha
[13:16:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm)
[13:16:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm)
[13:17:21] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[13:18:29] <urbanecm>	 syncing w/o testing, as it can't really be tested with write new, and its no-op otherwise
[13:19:10] <Dreamy_Jazz>	 Plus that change is on group0 wikis as it's in wmf.19 (so it shouldn't be an issue).
[13:19:32] <urbanecm>	 yup
[13:19:49] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:20:03] <godog>	 !log powercycle parse1002 - T339340
[13:20:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:06] <stashbot>	 T339340: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340
[13:20:09] <icinga-wm>	 PROBLEM - Check systemd state on parse1002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:21:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T342617)', diff saved to https://phabricator.wikimedia.org/P49704 and previous config saved to /var/cache/conftool/dbconfig/20230725-132121-ladsgroup.json
[13:21:26] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[13:21:39] <icinga-wm>	 RECOVERY - Check systemd state on parse1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340 (10fgiunchedi) I rebooted the host because I needed a puppet run on it, I'll leave it alone now!
[13:24:15] <icinga-wm>	 RECOVERY - Check systemd state on mw1424 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:49] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner to 16.0 [puppet] - 10https://gerrit.wikimedia.org/r/941398 (https://phabricator.wikimedia.org/T338460) (owner: 10Jelto)
[13:26:07] <wikibugs>	 (03PS1) 10Majavah: Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507)
[13:26:27] <wikibugs>	 10SRE, 10serviceops-radar, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[13:28:06] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[13:29:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373 (owner: 10Giuseppe Lavagetto)
[13:29:56] <wikibugs>	 (03Merged) 10jenkins-bot: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941414 (https://phabricator.wikimedia.org/T341934) (owner: 10Urbanecm)
[13:30:07] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up: Add support for writing both new and old to Hooks.php [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941400 (https://phabricator.wikimedia.org/T341586) (owner: 10Urbanecm)
[13:30:13] <urbanecm>	 there we go
[13:30:20] <Dreamy_Jazz>	 Great.
[13:30:22] <Dreamy_Jazz>	 Still around.
[13:30:52] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:941414|Add support for writing both new and old to Hooks.php (T341934 T341586)]], [[gerrit:941400|Follow-up: Add support for writing both new and old to Hooks.php (T341586)]]
[13:30:59] <stashbot>	 T341586: Allow write old and new for event table migration - https://phabricator.wikimedia.org/T341586
[13:30:59] <stashbot>	 T341934: Failing tests for CheckUser when event table migration config set to WRITE_BOTH and READ_NEW - https://phabricator.wikimedia.org/T341934
[13:31:10] <urbanecm>	 it'll go w/o the mwdebug stop though, as i mentioned.
[13:32:56] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: also escape the url_path parameter [software/conftool] - 10https://gerrit.wikimedia.org/r/941373 (owner: 10Giuseppe Lavagetto)
[13:38:21] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:941414|Add support for writing both new and old to Hooks.php (T341934 T341586)]], [[gerrit:941400|Follow-up: Add support for writing both new and old to Hooks.php (T341586)]] (duration: 07m 28s)
[13:38:26] <stashbot>	 T341586: Allow write old and new for event table migration - https://phabricator.wikimedia.org/T341586
[13:38:26] <stashbot>	 T341934: Failing tests for CheckUser when event table migration config set to WRITE_BOTH and READ_NEW - https://phabricator.wikimedia.org/T341934
[13:38:28] <wikibugs>	 (03PS4) 10Urbanecm: Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz)
[13:38:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz)
[13:38:38] <urbanecm>	 so, backport done
[13:38:43] <urbanecm>	 let's move on to the config change now
[13:38:45] <Dreamy_Jazz>	 Nice. Thanks!
[13:38:52] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cgoubert@cumin1001"
[13:38:53] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1486.eqiad.wmnet with OS buster
[13:39:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable write new on testwiki for CheckUser event tables migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940927 (https://phabricator.wikimedia.org/T330158) (owner: 10Dreamy Jazz)
[13:40:12] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]]
[13:40:13] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[13:40:15] <stashbot>	 T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158
[13:40:22] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "re-run to fix mw1486 - cgoubert@cumin1001"
[13:41:07] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "re-run to fix mw1486 - cgoubert@cumin1001"
[13:41:14] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859)
[13:41:49] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm and dreamyjazz: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:41:58] <urbanecm>	 Dreamy_Jazz: available at mwdebug now :). can you test?
[13:42:05] <Dreamy_Jazz>	 Sure
[13:42:09] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on parse1002.eqiad.wmnet with reason: T339340 - hw troubleshooting
[13:42:12] <stashbot>	 T339340: hw troubleshooting: CPU machine check failure for parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T339340
[13:42:22] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on parse1002.eqiad.wmnet with reason: T339340 - hw troubleshooting
[13:43:06] <Dreamy_Jazz>	 Will events appear in logstash for debug servers?
[13:43:12] <Dreamy_Jazz>	 Just want to look for any exceptions
[13:43:48] <claime>	 Dreamy_Jazz: https://logstash.wikimedia.org/app/dashboards#/view/mwdebug1002?_g=h@42b0d52&_a=h@7f0701a
[13:43:53] <urbanecm>	 Dreamy_Jazz: yes, they will, on the mwdebug server dashboard too
[13:44:01] <urbanecm>	 which claime helpfully linked, thank you.
[13:44:07] <Dreamy_Jazz>	 Thanks both
[13:44:09] <Dreamy_Jazz>	 Still testing
[13:44:39] <urbanecm>	 you can also enable Verbose log, which will direct all logs (even debug logs) to logstash
[13:45:34] <wikibugs>	 10SRE-swift-storage, 10Data-Persistence, 10Discovery-Search: Storage request: swift s3 bucket for flink search-update-pipeline checkpointing - https://phabricator.wikimedia.org/T342620 (10bking)
[13:46:59] <Dreamy_Jazz>	 Nearly done. Will need someone to inspect DB shortly as logstash doesn't show the insert queries
[13:47:11] <urbanecm>	 sure
[13:48:09] <urbanecm>	 (fwiw, you would see insert queries with verbose logging, but happy to inspect manually)
[13:48:16] <Dreamy_Jazz>	 Hmm.
[13:48:23] <Dreamy_Jazz>	 I didn't find them on that page
[13:49:29] <Dreamy_Jazz>	 Okay. My part in the testing is done.
[13:49:59] <Dreamy_Jazz>	 If you could inspect the DB and see if "cu_log_event" and "cu_private_event" have entries
[13:50:01] <urbanecm>	 https://logstash.wikimedia.org/goto/be7902d11198d60c31c068e7a3ee25ce shows a bunch of inserts
[13:50:02] <urbanecm>	 sure
[13:50:27] <urbanecm>	 i see two entries in each table
[13:50:27] <Dreamy_Jazz>	 Thanks for that link.
[13:51:04] <Dreamy_Jazz>	 Yup.
[13:51:16] <Dreamy_Jazz>	 That is the expected state
[13:51:18] <urbanecm>	 great
[13:51:25] <urbanecm>	 should we check Special:Checkuser works as expected too?
[13:51:27] <Dreamy_Jazz>	 (I moved twice and logged out and then back in)
[13:51:37] <Dreamy_Jazz>	 Sure.
[13:51:53] <urbanecm>	 yup, matches what i see in the tables
[13:52:07] <Dreamy_Jazz>	 The change in this config should not have affected anything in Special:CheckUser, but happy to test it anyway
[13:52:49] <urbanecm>	 just to be on the safe end
[13:53:08] <Dreamy_Jazz>	 For sure.
[13:53:24] <Dreamy_Jazz>	 Ready to test if you would like me to do so (will need CU rights on testwiki)
[13:53:36] <urbanecm>	 granted
[13:53:39] <urbanecm>	 please go ahead
[13:54:55] <Dreamy_Jazz>	 Special:CheckUser works as normal (no change). Will try Investigate
[13:55:38] <Dreamy_Jazz>	 I have hit an exception in Investigate, but it seems unrelated to this config change
[13:56:05] <Dreamy_Jazz>	 "Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki"
[13:56:19] <Dreamy_Jazz>	 I will check if that happens on non-debug servers
[13:56:33] <Dreamy_Jazz>	 Yup. Same error on non-debug servers
[13:56:36] <Dreamy_Jazz>	 So should be unrelated
[13:56:39] <urbanecm>	 okay, so unrelated
[13:56:45] <urbanecm>	 so, let's proceed then?
[13:56:49] <Dreamy_Jazz>	 Yes.
[13:56:57] <urbanecm>	 deploying
[13:57:02] <Dreamy_Jazz>	 I will file a bug report for the issue that I found in Investigate.
[13:57:05] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967)
[13:57:06] <urbanecm>	 ty
[13:57:14] <Dreamy_Jazz>	 Thanks for your help on this!
[13:57:19] <urbanecm>	 any time
[14:00:21] <sukhe>	 !log rolling out pdns-recursor update on A:dns-rec
[14:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:39] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940927|Enable write new on testwiki for CheckUser event tables migration (T330158)]] (duration: 22m 27s)
[14:02:42] <stashbot>	 T330158: Enable write new for the event table migration - https://phabricator.wikimedia.org/T330158
[14:02:43] <urbanecm>	 and live
[14:02:47] <Dreamy_Jazz>	 Yay!
[14:02:49] <urbanecm>	 Dreamy_Jazz: anything else i can help with today?
[14:02:49] <Dreamy_Jazz>	 Thanks again
[14:02:58] <Dreamy_Jazz>	 No. That should be everything.
[14:03:26] <urbanecm>	 okay, great!
[14:03:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:06:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:22] <Dreamy_Jazz>	 Issue found with Investigate reported as https://phabricator.wikimedia.org/T342655. Should be fine with early removal of CU rights if needed (noticed it expires a in around 50 mins)
[14:08:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:10:04] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM, proceed with the usual caution with this one" [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[14:10:30] <urbanecm>	 ty
[14:11:13] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10Jhancock.wm) 05Open→03Resolved
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:44] <wikibugs>	 10SRE, 10ops-codfw, 10decommission-hardware: decommission wdqs200[4-6] - https://phabricator.wikimedia.org/T342600 (10Jhancock.wm) DECOMed servers, ssd's removed, servers in storage, and updated in netbox.
[14:13:15] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: service::catalog: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941429 (https://phabricator.wikimedia.org/T341859)
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:37] <wikibugs>	 (03PS3) 10Ssingh: varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566)
[14:19:00] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the review! Addressed the comment." [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh)
[14:19:54] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh)
[14:24:22] <fabfur>	 !log start stopping services and rebooting lvs5006 (T335835)
[14:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:26:59] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[14:27:12] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet
[14:28:15] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[14:28:18] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet
[14:29:02] <hnowlan>	 !log disabling puppet on A:cp for rollout of r/941405
[14:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:13] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[14:29:15] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet
[14:29:39] <wikibugs>	 (03PS1) 10Zabe: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217)
[14:29:57] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[14:30:00] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5006.eqsin.wmnet
[14:30:03] <wikibugs>	 (03PS1) 10Zabe: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217)
[14:30:16] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet
[14:30:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: add gateway routing script, route device-analytics [puppet] - 10https://gerrit.wikimedia.org/r/941405 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[14:30:42] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=ats-be,name=cp2037.codfw.wmnet
[14:33:16] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet
[14:34:16] <icinga-wm>	 PROBLEM - pybal on lvs5006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:34:24] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5006 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:34:45] <vgutierrez>	 wtf?
[14:34:49] <vgutierrez>	 oh... expected :_)
[14:34:52] <sukhe>	 :D
[14:35:06] <elukey>	 see Valentin is always jumpy with pybal :D
[14:35:17] <vgutierrez>	 "start stopping"
[14:35:25] <vgutierrez>	 oxymoronic log lines by fabfur 
[14:35:33] <sukhe>	 you can wake him up from deep sleep by typing "pybal critical"
[14:35:44] <icinga-wm>	 RECOVERY - pybal on lvs5006 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:35:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:35:55] <fabfur>	 !log lvs5006 rebooted and services restarted (T335835)
[14:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:06] <vgutierrez>	 sukhe: wait, I'm the only one with a brain IRQ wired to my IRC client?
[14:36:17] <sukhe>	 :]
[14:36:21] <claime>	 That's called a taser
[14:36:25] <claime>	 And it's illegal in most places
[14:36:34] <claime>	 :p
[14:36:50] <vgutierrez>	 claime: anything good is illegal in at least some place
[14:37:04] <sukhe>	 bash
[14:37:05] <claime>	 x)
[14:37:21] <vgutierrez>	 sukhe: O:)
[14:37:51] <fabfur>	 there's also "stop stopping" and "start starting" but I use them only in few occasions
[14:38:02] <sukhe>	 https://bash.toolforge.org/quip/gXd8jYkBGiVuUzOdOLax
[14:38:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[14:39:00] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:20] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] kubernetes: add mw-misc "service" [puppet] - 10https://gerrit.wikimedia.org/r/940186 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[14:40:28] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: revscoring template change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[14:41:07] <wikibugs>	 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh)
[14:41:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add mw-misc service under ingress [dns] - 10https://gerrit.wikimedia.org/r/941403 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[14:41:34] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH)
[14:41:39] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10RobH)
[14:41:42] <wikibugs>	 (03CR) 10Jobo: [C: 03+2] groups: Add taavi to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/940269 (https://phabricator.wikimedia.org/T342307) (owner: 10Andrea Denisse)
[14:42:08] <wikibugs>	 (03PS1) 10Elukey: profile::logstash: allow more Istio ingress gateway logs [puppet] - 10https://gerrit.wikimedia.org/r/941434
[14:42:52] <vgutierrez>	 fabfur: begin and finish are your friends <3
[14:43:04] <wikibugs>	 10SRE, 10Traffic: Upgrade to pdns-recursor 4.8.4 - https://phabricator.wikimedia.org/T341611 (10ssingh) 05Open→03Resolved ` ||/ Name           Version         Architecture Description +++-==============-===============-============-================================= ii  pdns-recursor  4.8.4-1+wmf11u1 amd64...
[14:43:21] <fabfur>	 !log begin rebooting lvs5004 (T335835)
[14:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:48] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] service::catalog: add mw-misc [puppet] - 10https://gerrit.wikimedia.org/r/941429 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[14:44:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:45:06] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply
[14:45:39] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply
[14:45:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply
[14:46:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] varnish: handle varnish-frontend-hospital crash with a (null) line [puppet] - 10https://gerrit.wikimedia.org/r/940989 (https://phabricator.wikimedia.org/T342566) (owner: 10Ssingh)
[14:46:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply
[14:48:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:49:04] <icinga-wm>	 PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:49:18] <icinga-wm>	 PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: sync-gitlab-group-with-ldap.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:40] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[14:49:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH)
[14:49:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE: Q1:rack/setup/install wdqs101[789] - https://phabricator.wikimedia.org/T342660 (10RobH)
[14:50:55] <zabe>	 jouncebot: nowandnext
[14:50:56] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 9 minute(s)
[14:50:56] <jouncebot>	 In 1 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1600)
[14:51:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe)
[14:51:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe)
[14:52:47] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=ats-be,name=cp2037.codfw.wmnet
[14:54:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Isaac) FYI relevant past ticket on this particular set of permissions: T270438
[14:56:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:56:34] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:57:38] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:57:52] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:58:32] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:58:36] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[14:58:42] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[14:58:51] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:58:52] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[14:58:59] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[14:59:08] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[14:59:11] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[14:59:18] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[14:59:26] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[14:59:26] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115)
[14:59:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2] Remove the openjdk images based on stretch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/939256 (https://phabricator.wikimedia.org/T341115) (owner: 10Giuseppe Lavagetto)
[15:01:08] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941417 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe)
[15:08:07] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace translations for Mandailing (btm) [core] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941416 (https://phabricator.wikimedia.org/T335217) (owner: 10Zabe)
[15:08:39] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]]
[15:08:43] <stashbot>	 T335217: Add namespace translations in Mandailing - https://phabricator.wikimedia.org/T335217
[15:09:35] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:10:22] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[15:12:33] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route requests to proton via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/941440 (https://phabricator.wikimedia.org/T324678)
[15:13:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:14:51] <_joe_>	 !log removing all tags for docker image openjdk-8-jdk T341115
[15:14:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:55] <stashbot>	 T341115: Rationalize and update the use of base images in our docker-pkg repositories - https://phabricator.wikimedia.org/T341115
[15:16:31] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:941417|Add namespace translations for Mandailing (btm) (T335217)]], [[gerrit:941416|Add namespace translations for Mandailing (btm) (T335217)]] (duration: 07m 51s)
[15:16:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:16:35] <stashbot>	 T335217: Add namespace translations in Mandailing - https://phabricator.wikimedia.org/T335217
[15:17:05] <_joe_>	 !log removing all tags for docker image openjdk-8-jre T341115
[15:17:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:34] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile::logstash: allow more Istio ingress gateway logs [puppet] - 10https://gerrit.wikimedia.org/r/941434 (owner: 10Elukey)
[15:21:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:21:40] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:22:16] <wikibugs>	 (03PS2) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346)
[15:22:58] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[15:23:47] <logmsgbot>	 !log dancy@deploy1002 Started deploy [releng/jenkins-deploy@97b4674] (releasing): (no justification provided)
[15:24:21] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[15:25:17] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[15:25:18] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet
[15:25:32] <logmsgbot>	 !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host lvs5004.eqsin.wmnet
[15:25:45] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet
[15:25:56] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri)
[15:26:15] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[15:26:16] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (FY2022/2023-Q4): tcpircbot: enable logging to #wikimedia-cloud-feed - https://phabricator.wikimedia.org/T342666 (10fnegri) 05Open→03In progress
[15:26:24] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri)
[15:28:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:28:45] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet
[15:29:10] <icinga-wm>	 PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:29:40] <wikibugs>	 (03PS1) 10FNegri: tcpircbot: add another port for cloud IRC logging [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666)
[15:30:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[15:33:03] <wikibugs>	 (03PS3) 10Hashar: python-build: provide a python2 Bullseye image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940161 (https://phabricator.wikimedia.org/T342346)
[15:33:05] <wikibugs>	 (03PS4) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346)
[15:33:07] <wikibugs>	 (03PS1) 10Hashar: Remove python3-build-jessie (Jessie is EOL) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941442
[15:33:09] <wikibugs>	 (03PS1) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346)
[15:33:11] <wikibugs>	 (03PS1) 10Hashar: python-build: rebuild images for recent changes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941444 (https://phabricator.wikimedia.org/T342346)
[15:36:14] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: eqiad1: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619)
[15:36:36] <icinga-wm>	 RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[15:36:38] <fabfur>	 !log lvs5004 restarted and services are reactivating (T335835)
[15:36:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:38:23] <wikibugs>	 (03PS1) 10Cwhite: logstash: reroute istio-ingressgateway logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/941053
[15:38:56] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[15:40:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Untested but LGTM, though I believe it might cause a restart of docker daemon due to systemd unit change. There's probably a way to craft " [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[15:41:20] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: reroute istio-ingressgateway logs to webrequest partition [puppet] - 10https://gerrit.wikimedia.org/r/941053 (owner: 10Cwhite)
[15:43:43] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216)
[15:43:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: as expected https://puppet-compiler.wmflabs.org/output/941445/42693/" [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) (owner: 10Arturo Borrero Gonzalez)
[15:43:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+1] eqiad1: cloudnet: enable cloud-private subnet [puppet] - 10https://gerrit.wikimedia.org/r/941445 (https://phabricator.wikimedia.org/T342619) (owner: 10Arturo Borrero Gonzalez)
[15:46:02] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216)
[15:50:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "oh damn, the what: section of the commit message describes the old approach" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[15:52:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: varnish: add requestctl to X-analytics for static actions too [puppet] - 10https://gerrit.wikimedia.org/r/941448 (https://phabricator.wikimedia.org/T342577)
[15:52:47] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Remove unused plain HTTP services from LVS - https://phabricator.wikimedia.org/T236065 (10taavi)
[15:57:13] <logmsgbot>	 !log dancy@deploy1002 Finished deploy [releng/jenkins-deploy@97b4674] (releasing): (no justification provided) (duration: 33m 26s)
[15:57:52] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: httpbb: update ml-services tests [puppet] - 10https://gerrit.wikimedia.org/r/941449 (https://phabricator.wikimedia.org/T342266)
[15:58:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:00:04] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1600).
[16:00:04] <jouncebot>	 James_F and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:13] <dancy>	 o/
[16:00:25] <James_F>	 Heya.
[16:00:46] <James_F>	 I think mine were all deployed by now?
[16:01:00] <James_F>	 Oh, one of them was, one wasn't.
[16:01:43] <James_F>	 And for https://gerrit.wikimedia.org/r/c/operations/puppet/+/939757/ Alexandros said Traffic should merge, hmm.
[16:03:40] <rzl>	 taking a look 👋
[16:04:48] <rzl>	 James_F: hm, yeah, I could push the button on that if absolutely necessary but I'd be more comfortable having a trafficologist on hand
[16:05:03] <sukhe>	 I can look
[16:05:24] <rzl>	 amazing thank you
[16:05:48] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115)
[16:06:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[16:08:11] <sukhe>	 rzl: James_F: ok to merge then?
[16:08:39] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42694/console" [puppet] - 10https://gerrit.wikimedia.org/r/940406 (owner: 10Ahmon Dancy)
[16:08:39] <rzl>	 good by me!
[16:08:52] <rzl>	 once you're done I'll go ahead with dancy's two
[16:08:55] <sukhe>	 James_F: deploying, since you asked intiially
[16:08:58] <sukhe>	 rzl: noed
[16:09:01] <sukhe>	 sigh, noted
[16:09:03] <sukhe>	 where are the t's
[16:09:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[16:09:35] <rzl>	 los bu no forgoen 😔
[16:09:49] <sukhe>	 haha
[16:09:50] <dancy>	 haha
[16:11:44] <sukhe>	 all good, rolling out to the rest of the nodes
[16:12:29] <rzl>	 dancy: in the meantime, it doesn't look like there's any dependency or anything to test in between, I can just fire away with both, right?
[16:12:43] <dancy>	 Yes, they are unrelated changes.
[16:12:55] <rzl>	 👍
[16:13:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:14:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:15:56] <James_F>	 sukhe: Thabks!
[16:16:12] <sukhe>	 James_F: done
[16:16:39] <rzl>	 thanks sukhe <3 going ahead with the others now
[16:16:49] <sukhe>	 hanks
[16:16:53] <sukhe>	 :)
[16:16:59] <rzl>	 haha
[16:17:24] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1 C: 03+2] Remove unreferenced hiera data [puppet] - 10https://gerrit.wikimedia.org/r/940406 (owner: 10Ahmon Dancy)
[16:18:10] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Scap: scap_source Use the "group" consistently [puppet] - 10https://gerrit.wikimedia.org/r/361796 (https://phabricator.wikimedia.org/T342320) (owner: 10Thcipriani)
[16:18:18] <wikibugs>	 (03Abandoned) 10Jforrester: [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator [puppet] - 10https://gerrit.wikimedia.org/r/938295 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester)
[16:19:07] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:19:12] <fabfur>	 !log begin rebooting lvs5005 (T335835)
[16:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:33] <wikibugs>	 (03PS1) 10Ssingh: dns6001: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941450
[16:21:16] <rzl>	 dancy: puppet's finished on deploy1002
[16:21:33] <dancy>	 Thanks!
[16:22:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns6001: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941450 (owner: 10Ssingh)
[16:24:00] <icinga-wm>	 PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[16:24:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[16:24:12] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[16:24:29] <wikibugs>	 (03CR) 10BryanDavis: "Looks like it would work. Comments inline about how a one-time restart of existing services might be avoided." [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[16:25:46] <sukhe>	 DNS alerts in drmrs also expected
[16:25:54] <sukhe>	 er BGP alerts in drmrs because of DNS changes
[16:26:27] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] service::catalog: Add wikifunctions service [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[16:26:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns6001.wikimedia.org
[16:27:13] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns6001: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941420
[16:28:01] <wikibugs>	 10SRE, 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez)
[16:29:34] <icinga-wm>	 PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:29:42] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:30:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6001.wikimedia.org
[16:32:36] <icinga-wm>	 RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:32:44] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:33:27] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH)
[16:33:44] <wikibugs>	 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10RobH)
[16:34:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns6001: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941420 (owner: 10Ssingh)
[16:35:11] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10RobH)
[16:35:27] <wikibugs>	 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10RobH)
[16:35:36] <wikibugs>	 (03CR) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[16:35:38] <wikibugs>	 (03PS2) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638)
[16:35:48] <wikibugs>	 (03PS6) 10Andrew Bogott: docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031
[16:35:50] <wikibugs>	 (03PS21) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[16:36:01] <wikibugs>	 (03PS3) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638)
[16:37:55] <wikibugs>	 (03CR) 10Andrew Bogott: "Here's pcc output showing this as no longer modifying the docker line:" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[16:37:57] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet
[16:39:20] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Diff on existing usage looks good to me. https://puppet-compiler.wmflabs.org/output/941031/42697/cloudweb1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[16:39:54] <wikibugs>	 (03PS1) 10Ssingh: dns6002: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941452
[16:40:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-eqiad cluster: Roll restart of jvm daemons.
[16:40:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] images: fix debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/938272 (owner: 10Hnowlan)
[16:40:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns6002: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941452 (owner: 10Ssingh)
[16:41:07] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet
[16:41:35] <wikibugs>	 (03PS22) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[16:41:37] <fabfur>	 !log end rebooting lvs5005 (T335835)
[16:41:37] <wikibugs>	 (03PS1) 10Andrew Bogott: DO NOT MERGE, this is just a proof of concept [puppet] - 10https://gerrit.wikimedia.org/r/941454
[16:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:50] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): [spicerack] support including {project} in SAL messages - https://phabricator.wikimedia.org/T341793 (10fnegri) After discussing this with @Volans we think there's no need to modify the Spic...
[16:42:10] <icinga-wm>	 RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[16:42:14] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:43:51] <wikibugs>	 (03PS4) 10Ayounsi: [WIP] Initial SONiC config from Homer YAML [homer/public] - 10https://gerrit.wikimedia.org/r/940867 (https://phabricator.wikimedia.org/T320638)
[16:44:17] <wikibugs>	 (03PS2) 10Andrew Bogott: DO NOT MERGE, this is just a proof of concept [puppet] - 10https://gerrit.wikimedia.org/r/941454
[16:44:19] <wikibugs>	 (03PS23) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[16:44:27] <wikibugs>	 (03Merged) 10jenkins-bot: images: fix debug logging for memcache [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/938272 (owner: 10Hnowlan)
[16:45:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns6002.wikimedia.org
[16:45:29] <wikibugs>	 (03PS4) 10Ayounsi: Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638)
[16:46:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial OpenConfig/SONiC support to wmf-netbox [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/940515 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[16:46:12] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[16:46:46] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:47:12] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns6002: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941421
[16:47:20] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:47:52] <wikibugs>	 10SRE, 10Traffic: Perform katran load tests on lvs1013 - https://phabricator.wikimedia.org/T342618 (10Vgutierrez)
[16:49:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns6002.wikimedia.org
[16:49:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns6002: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941421 (owner: 10Ssingh)
[16:50:07] <wikibugs>	 (03CR) 10Andrew Bogott: "Here's a diff where it actually adds something:" [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[16:50:36] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah)
[16:50:36] <wikibugs>	 (03PS1) 10Elukey: role::kafka::logging: apply threads settings to brokers [puppet] - 10https://gerrit.wikimedia.org/r/941455
[16:50:40] <wikibugs>	 (03PS24) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[16:50:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10Jhancock.wm) @aborrero I wanted to check with you about the cabling information on these servers. (planning out the racking ahe...
[16:51:14] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah)
[16:51:20] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:51:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[16:51:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[16:51:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add perl536-sssd [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/941401 (https://phabricator.wikimedia.org/T335507) (owner: 10Majavah)
[16:51:52] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:52:06] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42703/console" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey)
[16:54:39] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM, especially if we get the nice improvement 😊" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey)
[16:56:36] <sukhe>	 !log dummy authdns-update
[16:56:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:22] <wikibugs>	 (03PS1) 10Dwisehaupt: Remove frav1002 for decom [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1700)
[17:07:59] <wikibugs>	 (03PS1) 10Majavah: conftool-data: Duplicate labweb service as cloudweb [puppet] - 10https://gerrit.wikimedia.org/r/941458 (https://phabricator.wikimedia.org/T317463)
[17:08:02] <wikibugs>	 (03PS1) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463)
[17:08:06] <wikibugs>	 (03PS1) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463)
[17:09:00] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] Remove frav1002 for decom [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678) (owner: 10Dwisehaupt)
[17:09:08] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/941457 (https://phabricator.wikimedia.org/T342678) (owner: 10Dwisehaupt)
[17:18:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] docker service: support a list of arbitrary bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/941031 (owner: 10Andrew Bogott)
[17:36:17] <wikibugs>	 (03PS1) 10Bernard Wang: Fix text showing on icon only buttons [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423
[17:39:22] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Allow disabling puppet on reboot (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[17:41:32] <wikibugs>	 (03PS1) 10Ladsgroup: beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683)
[17:42:32] <icinga-wm>	 RECOVERY - Check systemd state on mx1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:42:54] <wikibugs>	 (03PS1) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848)
[17:43:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: 10Func)
[17:43:18] <wikibugs>	 (03PS2) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848)
[17:44:22] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[17:45:13] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone.conf: turn on character restrictions for new projects and domains. [puppet] - 10https://gerrit.wikimedia.org/r/941464 (https://phabricator.wikimedia.org/T341509)
[17:45:16] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509)
[17:45:19] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Stop writing to extlinks old columns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941463 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup)
[17:45:22] <wikibugs>	 (03PS3) 10Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848)
[17:45:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott)
[17:47:34] <wikibugs>	 (03PS1) 10Ssingh: dns4004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941466
[17:47:36] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns4004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941467
[17:47:50] <wikibugs>	 (03PS2) 10Andrew Bogott: keystone: hack to reject all non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509)
[17:48:37] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns4004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941466 (owner: 10Ssingh)
[17:51:42] <wikibugs>	 (03PS3) 10Andrew Bogott: keystone: hack to reject all new non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509)
[17:51:50] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:51:56] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:51:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns4004.wikimedia.org
[17:53:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:55:22] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:55:34] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:56:50] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4004.wikimedia.org
[17:58:00] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns4004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941467 (owner: 10Ssingh)
[17:58:25] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:58:32] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:59:54] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:00:05] <jouncebot>	 jnuche and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T1800). nyaa~
[18:00:51] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/941455 (owner: 10Elukey)
[18:01:45] <wikibugs>	 (03PS1) 10Ssingh: dns4003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941469
[18:01:47] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns4003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941470
[18:02:16] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns4003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/941469 (owner: 10Ssingh)
[18:03:48] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:03:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:05:08] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50276 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:05:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:06:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns4003.wikimedia.org
[18:06:54] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:07:02] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:08:58] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:09:08] <icinga-wm>	 PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:11:41] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4003.wikimedia.org
[18:13:26] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns4003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/941470 (owner: 10Ssingh)
[18:13:30] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:13:40] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:21:01] <sukhe>	 !log dummy authdns-update returns
[18:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:07] <logmsgbot>	 !log dwisehaupt@cumin1001 START - Cookbook sre.dns.netbox
[18:21:27] <sukhe>	 oh interesting time
[18:21:29] <sukhe>	 we will see how this plays out
[18:21:36] <dwisehaupt>	 oops. :)
[18:21:41] <sukhe>	 haha all good, my bad too :)
[18:21:55] <dwisehaupt>	 mine is just a decommissioning for frav1002
[18:21:56] <sukhe>	 I was just doing a dummy run to make sure everything is fine with all hosts
[18:22:04] <sukhe>	 dwisehaupt: all yours
[18:22:09] <sukhe>	 please feel free to run it
[18:22:14] <sukhe>	 if you see any issues, please ping
[18:22:24] <dwisehaupt>	 will do, it should be done soonish.
[18:23:26] <logmsgbot>	 !log dwisehaupt@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001"
[18:24:13] <logmsgbot>	 !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: * - dwisehaupt@cumin1001"
[18:24:13] <logmsgbot>	 !log dwisehaupt@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:24:42] <dwisehaupt>	 sukhe: all good and clean.
[18:24:46] <sukhe>	 dwisehaupt: thanks!
[18:32:12] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Dwisehaupt) a:05Dwisehaupt→03Jclark-ctr Host powered off and ready for decom.
[18:32:34] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (10Dwisehaupt)
[18:52:26] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] Make kafka_config default cluster logic actually work [puppet] - 10https://gerrit.wikimedia.org/r/941475
[18:52:28] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476
[18:52:30] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477
[18:52:32] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] confd: Fix confd hostname [puppet] - 10https://gerrit.wikimedia.org/r/941478
[18:52:34] <wikibugs>	 (03PS1) 10Krinkle: [BETA HACK] scap: foreachwikiindblist: always filter for all-labs [puppet] - 10https://gerrit.wikimedia.org/r/941479
[18:53:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [BETA HACK] Attempt to secure Puppet DB better [puppet] - 10https://gerrit.wikimedia.org/r/941476 (owner: 10Krinkle)
[18:55:29] <wikibugs>	 (03PS1) 10Dwisehaupt: Remove frmon1001 and frmon2001 from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/941480 (https://phabricator.wikimedia.org/T342693)
[18:55:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [BETA HACK] Allow external access from anywhere to parsoid port 80 for CI purposes [puppet] - 10https://gerrit.wikimedia.org/r/941477 (owner: 10Krinkle)
[18:57:05] <wikibugs>	 (03CR) 10Ssingh: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/941367 (https://phabricator.wikimedia.org/T342154) (owner: 10Fabfur)
[18:59:01] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Bump version of extra plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/938210 (https://phabricator.wikimedia.org/T325315) (owner: 10Peter Fischer)
[19:06:50] <wikibugs>	 (03PS1) 10Bking: Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315)
[19:09:13] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315) (owner: 10Bking)
[19:09:34] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Increment BUILD_VERSION so plugin can build [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/941483 (https://phabricator.wikimedia.org/T325315) (owner: 10Bking)
[19:12:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Add Dell switches support to Homer/Cookbooks - https://phabricator.wikimedia.org/T320638 (10ayounsi)
[19:12:18] <wikibugs>	 (03PS2) 10Ayounsi: WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638)
[19:12:33] <wikibugs>	 (03PS3) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638)
[19:13:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone.conf: turn on character restrictions for new projects and domains. [puppet] - 10https://gerrit.wikimedia.org/r/941464 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott)
[19:13:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone: hack to reject all new non-alphanumerical project or domain names [puppet] - 10https://gerrit.wikimedia.org/r/941465 (https://phabricator.wikimedia.org/T341509) (owner: 10Andrew Bogott)
[19:13:50] <wikibugs>	 (03PS10) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945)
[19:13:52] <wikibugs>	 (03PS13) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[19:13:54] <wikibugs>	 (03CR) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) (owner: 10Jforrester)
[19:13:56] <wikibugs>	 (03PS7) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624
[19:13:58] <wikibugs>	 (03PS5) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632
[19:14:00] <wikibugs>	 (03PS11) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795
[19:14:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[19:15:36] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484
[19:15:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 (owner: 10Andrew Bogott)
[19:16:18] <wikibugs>	 (03PS2) 10Andrew Bogott: keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484
[19:17:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] keystone: fix name of patch file [puppet] - 10https://gerrit.wikimedia.org/r/941484 (owner: 10Andrew Bogott)
[19:26:27] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] "Looks good to me, ready for SRE to merge & deploy!" [puppet] - 10https://gerrit.wikimedia.org/r/941480 (https://phabricator.wikimedia.org/T342693) (owner: 10Dwisehaupt)
[19:34:53] <wikibugs>	 (03PS25) 10Andrew Bogott: Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640)
[19:37:20] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Can this please be deployed so we can call it? :-) We're trying to go live tomorrow at 16:00 UTC, so it'd be great to have this in place a" [puppet] - 10https://gerrit.wikimedia.org/r/941313 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[19:38:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: add docker_deploy profile [puppet] - 10https://gerrit.wikimedia.org/r/940992 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230725T2000).
[20:00:05] <jouncebot>	 bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:21] <kimberly_sarabia>	 Hello. I'm subbing for bwang
[20:01:24] <TheresNoTime>	 I cannot deploy this evening
[20:01:29] <kimberly_sarabia>	 tyty
[20:01:46] <taavi>	 I can deploy
[20:02:08] <kimberly_sarabia>	 taavi: thanks
[20:03:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423 (owner: 10Bernard Wang)
[20:04:33] <taavi>	 and now we wait for CI
[20:10:02] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon/docker: fix (maybe) the namespace and image name [puppet] - 10https://gerrit.wikimedia.org/r/941512
[20:15:48] <wikibugs>	 (03PS1) 10Tsevener: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941514 (https://phabricator.wikimedia.org/T341896)
[20:18:12] <wikibugs>	 (03PS14) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[20:18:14] <wikibugs>	 (03PS8) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624
[20:18:16] <wikibugs>	 (03PS6) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632
[20:18:18] <wikibugs>	 (03PS12) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795
[20:18:20] <wikibugs>	 (03PS1) 10Jforrester: [DNM] Move wikifunctions.org from locked-down to limited deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941515
[20:21:14] <wikibugs>	 (03Merged) 10jenkins-bot: Fix text showing on icon only buttons [skins/Vector] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941423 (owner: 10Bernard Wang)
[20:21:42] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:941423|Fix text showing on icon only buttons]]
[20:23:17] <logmsgbot>	 !log taavi@deploy1002 taavi and bwang: Backport for [[gerrit:941423|Fix text showing on icon only buttons]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:23:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] horizon/docker: fix (maybe) the namespace and image name [puppet] - 10https://gerrit.wikimedia.org/r/941512 (owner: 10Andrew Bogott)
[20:23:24] <taavi>	 kimberly_sarabia: please test
[20:23:40] <kimberly_sarabia>	 taavi: ack
[20:26:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:27:31] <kimberly_sarabia>	 taavi: LGTM
[20:27:46] <taavi>	 thanks, syncing
[20:27:55] <wikibugs>	 (03PS3) 10Zabe: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216)
[20:31:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[20:32:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon/docker: fix bind mount typo [puppet] - 10https://gerrit.wikimedia.org/r/941518 (https://phabricator.wikimedia.org/T341640)
[20:33:50] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:941423|Fix text showing on icon only buttons]] (duration: 12m 08s)
[20:33:55] <taavi>	 all done!
[20:34:07] <taavi>	 anyone have anything else to deploy?
[20:35:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: fix bind mount typo [puppet] - 10https://gerrit.wikimedia.org/r/941518 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[20:40:05] <zabe>	 o/
[20:40:21] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) (owner: 10Zabe)
[20:41:01] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for btmwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941446 (https://phabricator.wikimedia.org/T335216) (owner: 10Zabe)
[20:42:55] <zabe>	 !log create Wiktionary Mandailing # T335216
[20:42:58] <Amir1>	 zabe: <3
[20:42:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:59] <stashbot>	 T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216
[20:43:08] <Amir1>	 I'm around, let me know if things go weeeeee
[20:43:17] <zabe>	 addwiki ran through without errors
[20:43:30] <Amir1>	 Wohoooo
[20:43:56] <taavi>	 just in time for wikifunctions :D
[20:44:47] <logmsgbot>	 !log zabe@deploy1002 Started scap: T335216
[20:46:25] <logmsgbot>	 !log zabe@deploy1002 zabe: T335216 synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:48:23] <wikibugs>	 (03PS1) 10Zabe: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655)
[20:48:33] <wikibugs>	 (03PS1) 10Zabe: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655)
[20:48:46] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe)
[20:48:54] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe)
[20:50:55] <James_F>	 zabe:  Thank you!
[20:51:12] <James_F>	 Of course, we need the service to go live first. :-)
[20:52:28] <zabe>	 yw :)
[20:52:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:53:11] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T335216 (duration: 08m 24s)
[20:53:16] <stashbot>	 T335216: Create Wiktionary Mandailing - https://phabricator.wikimedia.org/T335216
[20:54:36] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057
[20:54:38] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057 (owner: 10Zabe)
[20:55:12] <logmsgbot>	 !log zabe@deploy1002 Started scap: update interwiki cache, [[gerrit:941057]]
[20:55:32] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/941057 (owner: 10Zabe)
[20:57:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:58:13] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon/docker: move to port 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941521 (https://phabricator.wikimedia.org/T341640)
[21:01:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] horizon/docker: move to port 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941521 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[21:02:33] <logmsgbot>	 !log zabe@deploy1002 Finished scap: update interwiki cache, [[gerrit:941057]] (duration: 07m 20s)
[21:02:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:05:49] <wikibugs>	 (03Merged) 10jenkins-bot: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/941499 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe)
[21:07:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: Create UserIdentityValue with correct wiki [extensions/CheckUser] (wmf/1.41.0-wmf.19) - 10https://gerrit.wikimedia.org/r/941498 (https://phabricator.wikimedia.org/T342655) (owner: 10Zabe)
[21:08:19] <logmsgbot>	 !log zabe@deploy1002 Started scap: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]]
[21:08:23] <stashbot>	 T342655: Special:Investigate: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki - https://phabricator.wikimedia.org/T342655
[21:10:01] <logmsgbot>	 !log zabe@deploy1002 zabe: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[21:18:26] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Backport for [[gerrit:941498|Create UserIdentityValue with correct wiki (T342655)]], [[gerrit:941499|Create UserIdentityValue with correct wiki (T342655)]] (duration: 10m 06s)
[21:18:30] <stashbot>	 T342655: Special:Investigate: Wikimedia\Assert\PreconditionException: Expected MediaWiki\User\UserIdentityValue to belong to 'afwiki', but it belongs to the local wiki - https://phabricator.wikimedia.org/T342655
[21:32:49] <wikibugs>	 (03PS1) 10Andrew Bogott: horizon: use the in-container path for static resources in codfw [puppet] - 10https://gerrit.wikimedia.org/r/941527 (https://phabricator.wikimedia.org/T341640)
[21:35:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] horizon: use the in-container path for static resources in codfw [puppet] - 10https://gerrit.wikimedia.org/r/941527 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[21:43:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon/docker: another move from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941528 (https://phabricator.wikimedia.org/T341640)
[21:46:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: another move from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941528 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[21:47:04] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for adding the test too!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/940968 (https://phabricator.wikimedia.org/T341793) (owner: 10FNegri)
[21:51:59] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Minor errors inside, LGTM otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/941441 (https://phabricator.wikimedia.org/T342666) (owner: 10FNegri)
[21:54:26] <wikibugs>	 (03PS3) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792)
[21:54:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[21:55:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon/docker: yet more moves from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941529 (https://phabricator.wikimedia.org/T341640)
[21:56:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:56:35] <wikibugs>	 (03CR) 10Volans: "Reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[21:57:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon/docker: yet more moves from 8081 to 8084 [puppet] - 10https://gerrit.wikimedia.org/r/941529 (https://phabricator.wikimedia.org/T341640) (owner: 10Andrew Bogott)
[21:59:54] <wikibugs>	 (03PS4) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792)
[22:00:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[22:01:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:08:28] <wikibugs>	 (03PS5) 10Bking: flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792)
[22:08:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-zk: Initiate new flink::zookeeper role [puppet] - 10https://gerrit.wikimedia.org/r/940243 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[22:21:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove haproxy log cloning [puppet] - 10https://gerrit.wikimedia.org/r/937601 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:23:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/940879 (https://phabricator.wikimedia.org/T108027) (owner: 10Filippo Giunchedi)
[23:15:30] <wikibugs>	 (03PS1) 10Ladsgroup: Replace the look with Wikimedia UI [software/bitu] - 10https://gerrit.wikimedia.org/r/941535
[23:27:16] <wikibugs>	 (03PS2) 10Ladsgroup: Replace the look with Wikimedia UI [software/bitu] - 10https://gerrit.wikimedia.org/r/941535
[23:46:44] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: adds-changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state