[00:17:25] <wikibugs>	 10SRE, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh)
[00:18:59] <urandom>	 !log rebooting Cassandra on sessionstore1001 — T327954
[00:19:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:03] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[00:24:02] <icinga-wm_>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2023-03-28 00:00:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:28:00] <wikibugs>	 (03PS3) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed [puppet] - 10https://gerrit.wikimedia.org/r/905243
[00:28:58] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:29:38] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: ensure get_global_wiki_user is only called when needed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[00:39:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560
[00:39:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560 (owner: 10TrainBranchBot)
[00:50:07] <urandom>	 !log rebooting  sessionstore1001 — T327954
[00:50:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:12] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[00:52:06] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: Add trwiki editquality isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/905561 (https://phabricator.wikimedia.org/T334158)
[00:56:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/905560 (owner: 10TrainBranchBot)
[01:41:53] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:47] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided)
[02:07:09] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 21s)
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:34:30] <wikibugs>	 (03PS1) 10KartikMistry: Enable  Section Translation on Kashmiri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541)
[05:05:42] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:05:44] <icinga-wm_>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:06:02] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:22] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:24] <icinga-wm_>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:42] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:25:42] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:25:46] <icinga-wm_>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:26:04] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:46] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:39:04] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:39:10] <icinga-wm_>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:41:53] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0600).
[06:17:44] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) I don't know what has happened over the night  but the zuul-merger service started alarming over night: `   Notificatio...
[06:19:05] <wikibugs>	 (03PS1) 10Hashar: zuul: disable monitoring for disabled merger service [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659)
[06:19:47] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[06:27:10] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 (10hashar)
[06:27:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Mailing-lists, 10Wikimedia-Incident: Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) - https://phabricator.wikimedia.org/T331626 (10hashar) 05Open→03Resolved That one has been solved after I have found...
[06:58:04] <wikibugs>	 (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/906307/1701/" [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[06:58:53] <hashar>	 if someone could please puppet-merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/906307/ , that will remove an erroneous alarm for contint2002 zuul-merger service which is intentionally disabled  but still has a monitoring enabled :)
[07:00:06] <jouncebot>	 Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0700)
[07:00:06] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:21] <hashar>	 I should ask in sre ;)
[07:00:22] <apergos>	 morning!  there are no trainees signed up for the window and just the one patch scheduled. kart is not apparently here yet, let's see if they want to self-deploy as usual, when they do arrive. 
[07:03:55] <kart_>	 ah. Sorry for late joining.
[07:03:57] <apergos>	 welcome kart_ !  are you self-deploying today?
[07:04:04] <apergos>	 no trainees signed up so ....
[07:04:26] <kart_>	 apergos: Yes :)
[07:04:41] <hashar>	 I am taking a break, will show up for the mediawiki train in an hour
[07:04:42] <apergos>	 great! go ahead when ready
[07:05:51] <kart_>	 (I realized that I've put wrong Gerrit link. Fixed it)
[07:06:20] <apergos>	 oh!  looknig now
[07:06:50] <apergos>	 yes ok, that seems like a much smaller change for a backport window :-D
[07:07:32] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541) (owner: 10KartikMistry)
[07:07:41] <kart_>	 apergos: :D
[07:08:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable  Section Translation on Kashmiri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906137 (https://phabricator.wikimedia.org/T326541) (owner: 10KartikMistry)
[07:09:43] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:906137|Enable  Section Translation on Kashmiri Wikipedia (T326541)]]
[07:09:47] <stashbot>	 T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541
[07:10:41] <wikibugs>	 10SRE, 10Traffic: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10elukey) >>! In T334078#8759196, @Ottomata wrote: > From a brief glance, those look like normal consumer reassignment messages.  Probably shouldn't be alerts.  @Ottomata I thought so yes, but I got a...
[07:11:13] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:906137|Enable  Section Translation on Kashmiri Wikipedia (T326541)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:16:53] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10ayounsi) p:05Triage→03Low
[07:16:56] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Abuse filter maintainer" "Abuse filter maintainers" "Zabe" --reason "per request [[:phab:T334147|T334147]]"
[07:17:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:01] <stashbot>	 T334147: Request to move translatable page: Abuse filter maintainer - https://phabricator.wikimedia.org/T334147
[07:19:15] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:906137|Enable  Section Translation on Kashmiri Wikipedia (T326541)]] (duration: 09m 31s)
[07:19:18] <stashbot>	 T326541: Enable Section Translation on Kashmiri Wikipedia - https://phabricator.wikimedia.org/T326541
[07:25:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] zuul: disable monitoring for disabled merger service [puppet] - 10https://gerrit.wikimedia.org/r/906307 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[07:28:22] <moritzm>	 !log installing ghostscript security updates
[07:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:20] <kart_>	 apergos: I'm done. Sorry for bit late reply.
[07:31:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Deploy anytime! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/905561 (https://phabricator.wikimedia.org/T334158) (owner: 10Kevin Bazira)
[07:31:32] <apergos>	 no worries, as long as you don't forget completely :-D
[07:31:48] <apergos>	 !log UTC morning backport and config training window done
[07:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:06] <kart_>	 apergos: :)
[07:39:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10ayounsi) Thanks for the feedback!  > Weighing this against the costs of maintaining them properly, that's the big question here.  Indeed :)  I opened...
[07:47:56] <wikibugs>	 (03PS5) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[07:53:57] <wikibugs>	 (03PS1) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934)
[07:53:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Resolved→03Open Re-opening as there are some EVPN elements outside the 'protocols bgp' context that also need to be added.  Will submit patch.
[07:56:25] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:56:58] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[07:58:32] <wikibugs>	 (03PS6) 10Jelto: install_server: simplify gitlab disk layout, drop lvm, use four SSDs [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172)
[08:00:06] <jouncebot>	 hashar and dduvall: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0800)
[08:00:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40551/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[08:00:27] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:01:52] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209)
[08:01:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot)
[08:02:35] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.41.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906530 (https://phabricator.wikimedia.org/T330209) (owner: 10TrainBranchBot)
[08:03:05] * Lucas_WMDE waves farewell to IE11
[08:05:13] <icinga-wm_>	 PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to the extent a Partman recipe can look good" [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[08:08:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Remove static routes for LVS VIPs from core routers - https://phabricator.wikimedia.org/T300877 (10cmooney) That codfw error is interesting actually, it makes me wonder why we have the "no-resolve" command on those routes?  Without that the error wo...
[08:08:51] <volans>	 !log restarting update-ubuntu-mirror.service on mirror1001 o check if it was a transient erro
[08:08:53] <icinga-wm_>	 RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:01] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.41.0-wmf.3  refs T330209
[08:09:05] <stashbot>	 T330209: 1.41.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T330209
[08:09:39] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "thanks for the review! I'll test and re-image gitlab2003 with the new partman config" [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[08:10:01] <wikibugs>	 (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[08:10:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: mute etcd-mirror pint promql checks [alerts] - 10https://gerrit.wikimedia.org/r/906011 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[08:10:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move varnishkafka-exporter stats to counters [puppet] - 10https://gerrit.wikimedia.org/r/906000 (https://phabricator.wikimedia.org/T334085) (owner: 10Filippo Giunchedi)
[08:10:39] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] install_server: simplify gitlab disk layout, drop lvm, use four SSDs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906030 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[08:10:55] <godog>	 jelto: I merged you change too
[08:11:00] <godog>	 your change even
[08:11:08] <jelto>	 godog: thanks a lot! :)
[08:11:20] <godog>	 sure np
[08:16:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: ignore 'status' label pint check [alerts] - 10https://gerrit.wikimedia.org/r/906020 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[08:21:31] <wikibugs>	 (03PS2) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954
[08:23:10] <wikibugs>	 (03PS4) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129
[08:23:12] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40552/console" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[08:23:19] <wikibugs>	 (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[08:23:27] <wikibugs>	 (03CR) 10David Caro: kubernetes: set NO_HOME for bulidservice and unset workingDir (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/901129 (owner: 10David Caro)
[08:24:30] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) (owner: 10Eevans)
[08:27:07] <claime>	 hashar: Is there anything in that release that could explain a very low opcache hit ratio in your opinion?
[08:27:30] <claime>	 It may just be that it needs to rebuild, but we're starting to warn heavy
[08:28:30] <hashar>	 claime: opcache? the php bytecodes one?
[08:28:37] <claime>	 hashar: yeah
[08:28:39] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[08:29:11] <claime>	 https://grafana.wikimedia.org/goto/DmpVatYVz?orgId=1
[08:29:32] <hashar>	 08:09:01 Finished php-fpm-restarts (duration: 02m 37s)
[08:29:35] <hashar>	 that is all I know :]
[08:30:15] <claime>	 We'll wait and see if it goes back up
[08:30:53] <hashar>	 I don't know anything about the caches anymore, I have long forgot or lost track of all the changes that happened on that front
[08:31:49] <hashar>	 maybe it is typical for a Thursday deploy as we get so many high traffic / lot of different code paths being newly loaded
[08:34:11] <hashar>	 over 9 days I see similar fall for the scap wikiversions last thusday https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-application-php?orgId=1&from=now-9d&to=now&viewPanel=33
[08:35:13] <hashar>	 maybe because we never invalidate opcache keys until php-fpm is restarted by scap
[08:35:24] <claime>	 Probably yeah
[08:35:57] <hashar>	 and ideally one day someone will figure out why the opcache gets corrupted or what kind of race condition we suffer from :D
[08:36:03] <claime>	 I'll keep an eye on it
[08:36:38] <hashar>	 if the responses times from the app servers backend stay similar, I think it is all fine
[08:37:25] <claime>	 Number of affected appservers is going down
[08:37:56] <claime>	 So I guess it just takes some time to replenish opcache after the restart
[08:39:05] <claime>	 latency increased a bit but that's kind of expected
[08:39:28] <claime>	 (and is going down anyways)
[08:39:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: mute puppet-ca pint checks for missing series [alerts] - 10https://gerrit.wikimedia.org/r/906533 (https://phabricator.wikimedia.org/T309182)
[08:39:57] <claime>	 actually it went up a bit for parsoid, but not for appservers
[08:40:16] <elukey>	 !log powercycle ml-serve2004 - host frozen, racadm getsel shows multi-bit errors in various DIMM slots
[08:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:11] <icinga-wm_>	 RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 2.00 ms
[08:43:41] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:46:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:46:38] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve2004.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:47:32] <wikibugs>	 (03PS3) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954
[08:51:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:52:03] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki::scap: force creation of the symlink when enabled" [puppet] - 10https://gerrit.wikimedia.org/r/905983 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[08:52:15] <wikibugs>	 (03PS4) 10Elukey: profile::kafka::broker: refactor TLS settings [puppet] - 10https://gerrit.wikimedia.org/r/905954
[08:53:17] <wikibugs>	 (03CR) 10Elukey: "Folks the PCC output is consistent for all nodes, but it varies for kafka logging since we already removed the pki migration config after " [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[08:56:49] <wikibugs>	 (03CR) 10JMeybohm: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[08:58:56] <claime>	 jouncebot: nowandnext
[08:58:56] <jouncebot>	 For the next 1 hour(s) and 1 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T0800)
[08:58:56] <jouncebot>	 In 1 hour(s) and 1 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000)
[08:58:57] <jouncebot>	 In 1 hour(s) and 1 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000)
[09:00:10] <claime>	 hashar: You're done with the train right? I can deploy a config change?
[09:08:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[09:09:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert)
[09:09:41] <wikibugs>	 (03PS4) 10Clément Goubert: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528)
[09:09:51] <claime>	 Mpf, rebase >_>
[09:11:08] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert)
[09:12:01] <wikibugs>	 (03Merged) 10jenkins-bot: jobrunners: Raise memory_limit to match parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/904463 (https://phabricator.wikimedia.org/T333528) (owner: 10Clément Goubert)
[09:12:16] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]]
[09:12:20] <stashbot>	 T333528: Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid - https://phabricator.wikimedia.org/T333528
[09:13:30] <jinxer-wm>	 (Emergency syslog message) firing: (2) Alert for device ssw1-e1-eqiad.mgmt.eqiad.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:13:37] <logmsgbot>	 !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[09:15:27] <volans>	 topranks: FYI ^^^ ssw1-e1-eqiad.mgmt.eqiad.wmnet
[09:18:30] <jinxer-wm>	 (Emergency syslog message) resolved: (2) Device ssw1-e1-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:19:27] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:904463|jobrunners: Raise memory_limit to match parsoid (T333528)]] (duration: 07m 11s)
[09:19:32] <stashbot>	 T333528: Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid - https://phabricator.wikimedia.org/T333528
[09:21:21] <wikibugs>	 (03PS2) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934)
[09:21:38] <topranks>	 volans: ack, thanks
[09:21:57] <topranks>	 just added to monitoring, perhaps should have left alarms off but good test :P
[09:22:16] <logmsgbot>	 !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye
[09:23:58] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Neat, I got a bit carried out with the refactor, so let me know if you prefer merging this, and then doing the refactor (that I can do if " [puppet] - 10https://gerrit.wikimedia.org/r/905243 (owner: 10Raymond Ndibe)
[09:24:27] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) I don't manage that spreadsheet, so I have no idea :) If that doesn't work we can easily switch to do the match on the Serial number column, that seems hardcoded for...
[09:25:21] <volans>	 no prob :D
[09:26:57] <wikibugs>	 (03PS3) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934)
[09:30:33] <elukey>	 !log kafka main codfw cluster migrated to PKI TLS certs for brokers - T319372
[09:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:38] <stashbot>	 T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372
[09:31:51] <wikibugs>	 10SRE, 10serviceops: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 (10elukey) Last steps:  * clean up certs in puppet private * verify if any change is needed in deployment-prep
[09:31:59] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) FYI there is already a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/loadbalancer/restart-pybal....
[09:37:44] <wikibugs>	 (03PS9) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253)
[09:38:07] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-main-codfw cluster: Roll restart of jvm daemons.
[09:38:47] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:39:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887)
[09:39:45] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:39:50] <wikibugs>	 (03CR) 10MVernon: "Hi," [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon)
[09:41:14] <wikibugs>	 (03PS1) 10Elukey: kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538
[09:42:41] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:42:45] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: rename aux-k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/906539 (https://phabricator.wikimedia.org/T334192)
[09:43:27] <logmsgbot>	 !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:44:13] <wikibugs>	 (03PS1) 10Cathal Mooney: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937)
[09:45:27] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "With my limited understanding, this lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[09:46:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:46:56] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[09:48:13] <wikibugs>	 (03CR) 10David Caro: smart_data_dump: adapt for newer ssacli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro)
[09:48:36] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[09:48:46] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: go with Grafana 9 only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi)
[09:51:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:55:59] <wikibugs>	 (03CR) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[09:56:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[09:56:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi)
[09:56:32] <wikibugs>	 (03PS2) 10Filippo Giunchedi: aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887)
[09:56:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[09:56:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46083 and previous config saved to /var/cache/conftool/dbconfig/20230406-095640-ladsgroup.json
[09:56:44] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[09:58:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46084 and previous config saved to /var/cache/conftool/dbconfig/20230406-095800-ladsgroup.json
[10:00:05] <jouncebot>	 mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000).
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1000)
[10:10:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] rest-gateway: add helmfile, enable mobileapps (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[10:10:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi)
[10:13:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P46085 and previous config saved to /var/cache/conftool/dbconfig/20230406-101306-ladsgroup.json
[10:13:26] <wikibugs>	 (03PS1) 10Elukey: admin_ng: bump max quota for ml-serve namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541
[10:13:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1003.mgmt.eqiad.wmnet with reboot policy FORCED
[10:14:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: go with Grafana 9 only [puppet] - 10https://gerrit.wikimedia.org/r/906537 (https://phabricator.wikimedia.org/T317887) (owner: 10Filippo Giunchedi)
[10:15:43] <wikibugs>	 (03PS5) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074)
[10:16:29] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541 (owner: 10Elukey)
[10:22:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: bump max quota for ml-serve namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/906541 (owner: 10Elukey)
[10:23:50] <icinga-wm_>	 PROBLEM - Host blog.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[10:25:08] <icinga-wm_>	 RECOVERY - Host blog.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 23.80 ms
[10:25:59] <wikibugs>	 (03CR) 10Muehlenhoff: Add an in place Debian upgrade script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[10:26:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[10:26:13] <wikibugs>	 (03CR) 10Volans: "Nice! I've left few minor nits/possible improvement, none of them is a blocker. The rest LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon)
[10:26:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Lovely!!!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey)
[10:27:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[10:27:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey)
[10:27:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[10:28:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[10:28:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P46086 and previous config saved to /var/cache/conftool/dbconfig/20230406-102813-ladsgroup.json
[10:29:38] <wikibugs>	 (03CR) 10Volans: "Thanks for the replies, I don't want to be a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[10:30:53] <wikibugs>	 (03Merged) 10jenkins-bot: kafka: remove setting to avoid checking the hostname in TLS certs [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey)
[10:36:22] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:36:34] <icinga-wm_>	 PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:36:44] <icinga-wm_>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:37:28] * volans checking  calendar
[10:38:09] <wikibugs>	 (03PS9) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[10:38:11] <wikibugs>	 (03PS6) 10Hnowlan: rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074)
[10:39:02] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[10:39:40] <wikibugs>	 (03CR) 10Muehlenhoff: "After some more investigation I think I know the issue: cloudvirt1019/cloudvirt1020 are unicorns since they are the only two remaining two" [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro)
[10:39:50] <icinga-wm_>	 RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:40:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudvirtlocal1003.mgmt.eqiad.wmnet with reboot policy FORCED
[10:41:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[10:41:14] <volans>	 mmmh there is a maintenance but sems a different cable
[10:41:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[10:42:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudvirtlocal10[01-03] - https://phabricator.wikimedia.org/T329863 (10Jclark-ctr)
[10:42:56] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:43:18] <icinga-wm_>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:43:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46087 and previous config saved to /var/cache/conftool/dbconfig/20230406-104319-ladsgroup.json
[10:43:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[10:43:24] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[10:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[10:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[10:44:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[10:44:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:44:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[10:44:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46088 and previous config saved to /var/cache/conftool/dbconfig/20230406-104435-ladsgroup.json
[10:46:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46089 and previous config saved to /var/cache/conftool/dbconfig/20230406-104644-ladsgroup.json
[10:50:47] <wikibugs>	 (03PS10) 10MVernon: sre.swift.remove-ghost-objects: new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253)
[10:53:00] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:53:06] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:53:38] <wikibugs>	 (03CR) 10MVernon: "Thanks again!" [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon)
[10:54:40] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:54:46] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:58:07] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add helmfile, enable mobileapps [deployment-charts] - 10https://gerrit.wikimedia.org/r/895327 (https://phabricator.wikimedia.org/T329074) (owner: 10Hnowlan)
[11:01:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P46090 and previous config saved to /var/cache/conftool/dbconfig/20230406-110151-ladsgroup.json
[11:03:06] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:03:28] <icinga-wm_>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:09:58] <icinga-wm_>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 40 probes of 780 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:11:16] <topranks>	 ^^ problem on transit cct from codfw to eqdfw, not sure it should cause the atlas alert though
[11:15:08] <icinga-wm_>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:15:44] <icinga-wm_>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 25 probes of 780 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:16:26] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:16:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P46091 and previous config saved to /var/cache/conftool/dbconfig/20230406-111657-ladsgroup.json
[11:21:28] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:21:48] <icinga-wm_>	 PROBLEM - BFD status on cr3-knams is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:23:57] <wikibugs>	 (03PS1) 10Muehlenhoff: smart: Disable smart-dump for servers with hpsa [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997)
[11:26:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff)
[11:27:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I think the istio ingress module might get you in trouble here, at least the staging part of it. It was developed with wikikube only in mi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/904777 (https://phabricator.wikimedia.org/T330414) (owner: 10Ilias Sarantopoulos)
[11:28:10] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:28:28] <icinga-wm_>	 RECOVERY - BFD status on cr3-knams is OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:29:04] <topranks>	 ^^ this was very odd, different provider, Dallas area being common though.
[11:29:27] <topranks>	 what was strange is that IPv6 was working, and OSPF was up, but BFD was down and IPv4 wasn't working
[11:29:33] <topranks>	 has come back now 
[11:29:59] <topranks>	 I'd be slightly worried there is a bad secondary path we got flipped to due to some wan re-routing.
[11:31:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM cookbook/python wise. As for the swift logic I'll leave it to the swift experts." [cookbooks] - 10https://gerrit.wikimedia.org/r/905595 (https://phabricator.wikimedia.org/T327253) (owner: 10MVernon)
[11:32:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T333332)', diff saved to https://phabricator.wikimedia.org/P46092 and previous config saved to /var/cache/conftool/dbconfig/20230406-113203-ladsgroup.json
[11:32:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[11:32:08] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[11:32:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[11:32:23] <volans>	 topranks: there was a planned work in the calendar, not sure if it can be related
[11:32:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46093 and previous config saved to /var/cache/conftool/dbconfig/20230406-113226-ladsgroup.json
[11:34:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46094 and previous config saved to /var/cache/conftool/dbconfig/20230406-113436-ladsgroup.json
[11:39:34] <topranks>	 volans: thanks yeah I seen that, shouldn't be related to any of these based on the info 
[11:41:04] <topranks>	 things have been stable for past ~10mins anyway, ripe atlas probes are back at same success level as previous 
[11:41:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Add krb2002 as additional KDC [puppet] - 10https://gerrit.wikimedia.org/r/906560 (https://phabricator.wikimedia.org/T331695)
[11:41:14] <topranks>	 I'll continue to keep an eye on thins 
[11:41:43] <volans>	 ack, thx
[11:41:48] <volans>	 lmk if we (oncall) can help
[11:49:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P46095 and previous config saved to /var/cache/conftool/dbconfig/20230406-114942-ladsgroup.json
[11:49:45] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff)
[11:58:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695)
[12:03:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[12:04:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P46096 and previous config saved to /var/cache/conftool/dbconfig/20230406-120448-ladsgroup.json
[12:08:37] <wikibugs>	 (03CR) 10DCausse: "\o/" [software/spicerack] - 10https://gerrit.wikimedia.org/r/906538 (owner: 10Elukey)
[12:09:18] <wikibugs>	 (03CR) 10Ayounsi: Add EVPN protocol config for enabled L3 switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[12:11:03] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[12:13:41] <wikibugs>	 (03PS1) 10Muehlenhoff: zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564
[12:14:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff)
[12:14:47] <wikibugs>	 (03PS1) 10Jelto: install_server: hard code raid sizes for gitlab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172)
[12:16:30] <wikibugs>	 (03CR) 10Jelto: "as discussed in IRC, moving from relative to absolute raid sizes for GitLab" [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:19:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T333332)', diff saved to https://phabricator.wikimedia.org/P46097 and previous config saved to /var/cache/conftool/dbconfig/20230406-121955-ladsgroup.json
[12:19:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[12:20:00] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[12:20:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[12:20:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46098 and previous config saved to /var/cache/conftool/dbconfig/20230406-122018-ladsgroup.json
[12:21:52] <wikibugs>	 (03PS2) 10Muehlenhoff: zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564
[12:22:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46099 and previous config saved to /var/cache/conftool/dbconfig/20230406-122229-ladsgroup.json
[12:23:07] <wikibugs>	 (03CR) 10Muehlenhoff: "No idea why PCC is marked as failing, the result seems all fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[12:23:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff)
[12:25:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:25:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Let's give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:25:10] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Ah nice. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff)
[12:26:18] <dcausse>	 !log restarting blazegraph on wdqs1012 (BlazegraphFreeAllocatorsDecreasingRapidly)
[12:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] zuul-merger: Make auto restart dependent on whether service is enabled or not [puppet] - 10https://gerrit.wikimedia.org/r/906564 (owner: 10Muehlenhoff)
[12:29:51] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] install_server: hard code raid sizes for gitlab partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906565 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[12:32:29] <wikibugs>	 (03CR) 10Ayounsi: "Some initial comments." [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney)
[12:33:36] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[12:35:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:35:07] <wikibugs>	 (03PS4) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934)
[12:37:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P46100 and previous config saved to /var/cache/conftool/dbconfig/20230406-123735-ladsgroup.json
[12:41:16] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[12:48:01] <wikibugs>	 (03Abandoned) 10David Caro: smart_data_dump: adapt for newer ssacli [puppet] - 10https://gerrit.wikimedia.org/r/904747 (https://phabricator.wikimedia.org/T306354) (owner: 10David Caro)
[12:49:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff)
[12:50:43] <godog>	 !log import grafana 9.4 T317887
[12:50:46] <wikibugs>	 (03PS1) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570
[12:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:49] <stashbot>	 T317887: Upgrade to Grafana 9 - https://phabricator.wikimedia.org/T317887
[12:51:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[12:52:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[12:52:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P46101 and previous config saved to /var/cache/conftool/dbconfig/20230406-125242-ladsgroup.json
[12:52:49] <wikibugs>	 (03PS2) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570
[12:53:31] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Clement_Goubert) FWIW, the cookbook can be used, but it needs to be given the actual lvs servers to run on. Assuming `lvs1020` and `lvs2010` are secondaries, `lvs1...
[12:55:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[12:59:31] <wikibugs>	 (03PS1) 10Elukey: Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068)
[13:00:02] <wikibugs>	 (03CR) 10Elukey: "Already imported the pristine/upstream release and pushed to gerrit." [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1300).
[13:00:05] <jouncebot>	 mazevedo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:17] <mazevedo>	 hi!
[13:00:27] <Lucas_WMDE>	 I can deploy in 15-30 minutes if no one else is around
[13:00:38] <mazevedo>	 ok, thanks!
[13:03:03] <wikibugs>	 (03PS3) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570
[13:04:22] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Traffic: Abstract LVS restart using cookbook - https://phabricator.wikimedia.org/T334166 (10Volans) Thanks for the clarification @Clement_Goubert
[13:04:51] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[13:05:40] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate cxserver to mw-api-int - https://phabricator.wikimedia.org/T334204 (10Clement_Goubert)
[13:07:44] <wikibugs>	 (03PS3) 10Clément Goubert: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334060)
[13:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T333332)', diff saved to https://phabricator.wikimedia.org/P46102 and previous config saved to /var/cache/conftool/dbconfig/20230406-130749-ladsgroup.json
[13:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[13:07:54] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[13:08:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46103 and previous config saved to /var/cache/conftool/dbconfig/20230406-130812-ladsgroup.json
[13:08:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[13:08:57] <wikibugs>	 (03PS4) 10Muehlenhoff: zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570
[13:09:01] <wikibugs>	 (03PS4) 10Clément Goubert: cxserver: Switch to mw-api-int-async on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903646 (https://phabricator.wikimedia.org/T334204)
[13:09:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: fix varnishkafka metric names, deploy to all sites [alerts] - 10https://gerrit.wikimedia.org/r/906574 (https://phabricator.wikimedia.org/T309182)
[13:09:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[13:09:58] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[13:10:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46104 and previous config saved to /var/cache/conftool/dbconfig/20230406-131022-ladsgroup.json
[13:10:27] <wikibugs>	 (03CR) 10Elukey: "now that I think about it, this may affect deployment-prep's settings (and maybe pontoon ones). Should we set profile::kafka::broker:use_p" [puppet] - 10https://gerrit.wikimedia.org/r/905954 (owner: 10Elukey)
[13:10:53] <wikibugs>	 (03PS1) 10Atieno: blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205)
[13:11:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix varnishkafka metric names, deploy to all sites [alerts] - 10https://gerrit.wikimedia.org/r/906574 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[13:15:45] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Looks better indeed :)" [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[13:19:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] zuul::merger: Fix up checks [puppet] - 10https://gerrit.wikimedia.org/r/906570 (owner: 10Muehlenhoff)
[13:22:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add upstream release 1.15.7 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/905959 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[13:22:10] <wikibugs>	 (03PS5) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934)
[13:22:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:22:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:22:42] * volans here
[13:22:57] <godog>	 here too, checking
[13:23:01] <volans>	 acked
[13:23:06] <godog>	 thank you!
[13:23:30] <volans>	 btw the link to alerts doesn't work
[13:23:47] <XioNoX>	 https://librenms.wikimedia.org/bill/bill_id=28/
[13:23:51] <XioNoX>	 hello analytics
[13:24:02] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] hieradata: rename aux-k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/906539 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi)
[13:24:04] <wikibugs>	 (03CR) 10Cathal Mooney: Add EVPN protocol config for enabled L3 switches (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[13:24:21] <elukey>	 XioNoX: I was about to ask :)
[13:24:22] <volans>	 XioNoX: do you have already a hostname?
[13:24:51] <XioNoX>	 volans: many I'd guess
[13:25:06] <wikibugs>	 (03CR) 10Kamila Součková: "Just kamila doing her best "confused kamila" impression" [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[13:25:11] <godog>	 a bunch of an-presto looks like, https://librenms.wikimedia.org/device/160/ports
[13:25:24] <XioNoX>	 analytics1* 
[13:25:27] <XioNoX>	 https://librenms.wikimedia.org/device/device=160/tab=port/port=14063/ for example
[13:25:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P46106 and previous config saved to /var/cache/conftool/dbconfig/20230406-132528-ladsgroup.json
[13:25:34] <XioNoX>	 they all contribute
[13:25:48] <XioNoX>	 probably worth checking who ran a large job
[13:25:48] <elukey>	 https://yarn.wikimedia.org/cluster/app/application_1678266962370_166166 started 41 mins ago, does it match?
[13:25:51] <godog>	 indeed
[13:25:52] <elukey>	 (more or less)
[13:26:09] <volans>	 checking
[13:26:23] <XioNoX>	 seems close enough
[13:26:29] <elukey>	 seems allocating a ton of resources
[13:26:39] <claime>	 elukey: didn´t  it start 41hours ago ?
[13:26:48] <claime>	 Elapsed:  41hrs, 17mins, 11sec 
[13:26:51] <elukey>	 claime: yes you are right, sorry
[13:26:53] <godog>	 I see traffic rising starting at 12 utc
[13:27:06] <elukey>	 but it may have started to use resources only recently
[13:27:23] <claime>	 elukey: fair enough, I was just sanity checking myself x)
[13:27:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:27:30] <jinxer-wm>	 (Primary outbound port utilisation over 80%  #page) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[13:27:39] <volans>	 between 12 adn 12:05 the start yes
[13:27:40] <elukey>	 claime: thanks a lot for double checking :)
[13:27:52] <wikibugs>	 (03CR) 10Hokwelum: [C: 03+1] "looks good! Thank you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/902738 (owner: 10Meno25)
[13:28:43] <elukey>	 it is probably that job, it is using a ton of executors https://yarn.wikimedia.org/proxy/application_1678266962370_166166/executors/
[13:29:13] <volans>	 memory used, disk used, cores... and not network used :D
[13:29:29] <volans>	 *and no
[13:29:50] <XioNoX>	 is it possible to know who ran it?
[13:30:00] <elukey>	 yes the username is aitolkyn
[13:30:18] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[13:30:24] * Lucas_WMDE here
[13:30:27] <elukey>	 probably a collaborator, I see a gmail email list
[13:30:31] <elukey>	 *listed
[13:30:38] <wikibugs>	 (03PS10) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321)
[13:31:23] <volans>	 the traffic did not went back to normal values yet fwiw
[13:31:34] <volans>	 although the page recovered
[13:31:51] <Lucas_WMDE>	 ok, then I’ll hold off on the deployment (fyi mazevedo)
[13:32:11] <wikibugs>	 (03PS3) 10Daniel Kinzler: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529)
[13:32:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[13:32:26] <elukey>	 we can kill it easily if anything happens
[13:32:29] <elukey>	 Cc: steve_munene: --^
[13:32:37] <godog>	 volans: FWIW the link I think works, though the page already recovered a minute later from https://librenms.wikimedia.org/device/device=160/tab=logs/section=eventlog/
[13:32:45] <mazevedo>	 Lucas_WMDE ack
[13:33:15] <volans>	 godog: but we got the recovery 5m after
[13:33:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[13:34:01] <XioNoX>	 is it possible do something else about it than kill it?
[13:34:05] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] blubber: Bump blubber version to v0.17.0 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/906575 (https://phabricator.wikimedia.org/T334205) (owner: 10Atieno)
[13:34:12] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: only backup toolforge things every other day [puppet] - 10https://gerrit.wikimedia.org/r/906579
[13:34:23] <elukey>	 XioNoX: not that I know
[13:34:43] <logmsgbot>	 !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye
[13:34:46] <volans>	 do we have any kind of rate-limiter that could be applied?
[13:34:57] <godog>	 volans: yeah I'm not sure why yet re: recovery not being immediate
[13:35:34] <wikibugs>	 (03CR) 10Hnowlan: api-gateway: add REST gateway Lua CSP handler (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[13:36:35] <wikibugs>	 (03CR) 10Ayounsi: "A few comments inline. Overall lgtm." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney)
[13:39:16] <icinga-wm_>	 RECOVERY - Check systemd state on contint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:37] <volans>	 the traffic has gone a bit down
[13:40:33] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[13:40:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P46108 and previous config saved to /var/cache/conftool/dbconfig/20230406-134035-ladsgroup.json
[13:41:47] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309)
[13:42:22] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309)
[13:44:13] <wikibugs>	 (03CR) 10Cathal Mooney: "Thanks for the review, yeah makes sense for it to be a file.  As for the extention change that'll require more work as it'll mean changing" [puppet] - 10https://gerrit.wikimedia.org/r/906103 (https://phabricator.wikimedia.org/T333007) (owner: 10Cathal Mooney)
[13:44:44] <icinga-wm_>	 PROBLEM - HTTPS on clouddumps1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[13:45:23] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[13:46:24] <icinga-wm_>	 RECOVERY - HTTPS on clouddumps1001 is OK: SSL OK - Certificate dumps.wikimedia.org valid until 2023-06-05 08:49:55 +0000 (expires in 59 days) https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[13:50:29] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182)
[13:51:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs6003.drmrs.wmnet
[13:55:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46109 and previous config saved to /var/cache/conftool/dbconfig/20230406-135541-ladsgroup.json
[13:55:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:55:45] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[13:55:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[13:56:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46110 and previous config saved to /var/cache/conftool/dbconfig/20230406-135604-ladsgroup.json
[13:56:28] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309)
[13:56:48] <urandom>	 !log rebooting sessionstore1001 — T327954
[13:56:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:52] <stashbot>	 T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954
[13:57:40] <Lucas_WMDE>	 I lost track of the channel for a bit – would it be okay to deploy a config change now?
[13:58:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46111 and previous config saved to /var/cache/conftool/dbconfig/20230406-135813-ladsgroup.json
[13:58:26] <claime>	 Lucas_WMDE: traffic's back to normal, I think you can go ahead
[13:58:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695)
[13:58:30] <Lucas_WMDE>	 ok thanks
[13:58:52] <Lucas_WMDE>	 mazevedo: if you’re still around I can deploy the config change now
[13:59:01] <Lucas_WMDE>	 (the backports window is almost over but there’s nothing immediately after it)
[14:00:47] <mazevedo>	 hey
[14:00:52] <mazevedo>	 still here!
[14:00:54] <Lucas_WMDE>	 ok!
[14:01:23] <wikibugs>	 (03PS4) 10Lucas Werkmeister (WMDE): Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo)
[14:01:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo)
[14:01:43] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs6003.drmrs.wmnet
[14:02:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to upstream version 1.15.7 [debs/istio] - 10https://gerrit.wikimedia.org/r/906571 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[14:02:22] <wikibugs>	 (03Merged) 10jenkins-bot: Add session schema config for mobile apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/905555 (https://phabricator.wikimedia.org/T331481) (owner: 10Mazevedo)
[14:02:28] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] istio: upgrade to upstream version 1.15.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/905956 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[14:02:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]]
[14:02:38] <stashbot>	 T331481: Generalize Android MEP session schema for iOS to use - https://phabricator.wikimedia.org/T331481
[14:03:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 mazevedo and lucaswerkmeister-wmde: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:04:08] <Lucas_WMDE>	 mazevedo: can you test the change on mwdebug?
[14:04:20] <mazevedo>	 Lucas_WMDE on it
[14:05:04] <mazevedo>	 Lucas_WMDE looking good, thanks!
[14:05:10] <Lucas_WMDE>	 ok thanks!
[14:08:32] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided)
[14:08:43] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 11s)
[14:10:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:905555|Add session schema config for mobile apps (T331481)]] (duration: 07m 54s)
[14:10:33] <stashbot>	 T331481: Generalize Android MEP session schema for iOS to use - https://phabricator.wikimedia.org/T331481
[14:11:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] Create a separate Hiera variable of KDCs specifically for use in client config [puppet] - 10https://gerrit.wikimedia.org/r/906563 (https://phabricator.wikimedia.org/T331695) (owner: 10Muehlenhoff)
[14:12:05] <sukhe>	 BGP alerts in drmrs epxected
[14:12:17] <Lucas_WMDE>	 mazevedo: should be done now, thanks for your patience :)
[14:12:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye
[14:12:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye
[14:13:13] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040 (owner: 10DCausse)
[14:13:19] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey) ` root@apt2001:/srv/wikimedia# reprepro lsbycomponent istio-cni istio-cni |  1.9.5-1 | bullseye-wikimedia | component/istio195 | amd64 istio-cni | 1.15.7-1 | bullseye-wikimedia...
[14:13:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P46112 and previous config saved to /var/cache/conftool/dbconfig/20230406-141319-ladsgroup.json
[14:14:02] <elukey>	 !log upload new istio-cni and istioctl 1.15.7 debian package versions to bullseye-wikimedia - T334068
[14:14:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:06] <stashbot>	 T334068: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068
[14:15:10] <icinga-wm_>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:15:17] <sukhe>	 ^ expected
[14:16:36] <wikibugs>	 (03PS1) 10Elukey: custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068)
[14:16:59] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops, 10Patch-For-Review: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey)
[14:17:43] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs/balancer: unify hiera post bullseye upgrade (eqsin) [puppet] - 10https://gerrit.wikimedia.org/r/906583 (https://phabricator.wikimedia.org/T321309)
[14:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: increase jvm-overhead.max [deployment-charts] - 10https://gerrit.wikimedia.org/r/906040 (owner: 10DCausse)
[14:19:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[14:20:44] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] service: move device-analytics to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/899607 (https://phabricator.wikimedia.org/T320967) (owner: 10Hnowlan)
[14:21:16] <elukey>	 !log upgrade istioctl on deploy[12]002 and istio-cni on ml-serve[12]00[1-8] manually - T334068
[14:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:20] <stashbot>	 T334068: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068
[14:21:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] custom_deploy.d: upgrade istio to 1.15.7-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/906590 (https://phabricator.wikimedia.org/T334068) (owner: 10Elukey)
[14:21:47] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:21:54] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:22:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192)
[14:23:43] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs6003: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/906592 (https://phabricator.wikimedia.org/T321309)
[14:24:45] <Amir1>	 we are having somewhat of an outage atm
[14:24:49] <Amir1>	 with s1 in eqiad
[14:25:31] <Amir1>	        Slave_SQL_Running_State: Waiting for semi-sync ACK from slave
[14:25:34] <Amir1>	 COME ON
[14:25:52] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 73 connections established with conf2005.codfw.wmnet:4001 (min=74) https://wikitech.wikimedia.org/wiki/PyBal
[14:26:09] <Amir1>	 marostegui: can I disable semi-sync on eqiad master of s1? All of the s1 is lagged for a minute
[14:26:25] <volans>	 Amir1: need any help from oncallers?
[14:26:48] <Amir1>	 checking error rate would be amazing
[14:26:59] <Amir1>	 restarted replication
[14:27:13] <Amir1>	 now at Slave_SQL_Running_State: Reading event from the relay log
[14:27:16] <volans>	 ack
[14:27:23] <Amir1>	 let's see if it gets better
[14:27:34] <Amir1>	 back to Slave_SQL_Running_State: Waiting for semi-sync ACK from slave
[14:27:43] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) https://wikitech.wikimedia.org/wiki/PyBal
[14:27:48] <volans>	 did we failover s1 master recently?
[14:27:49] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) https://wikitech.wikimedia.org/wiki/PyBal
[14:27:59] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 91 connections established with conf2004.codfw.wmnet:4001 (min=92) https://wikitech.wikimedia.org/wiki/PyBal
[14:28:04] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs6003: update iface names for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/906592 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:28:11] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 77 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal
[14:28:21] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] "LGTM except for the commented." [deployment-charts] - 10https://gerrit.wikimedia.org/r/890887 (https://phabricator.wikimedia.org/T326321) (owner: 10Hnowlan)
[14:28:26] <volans>	 hnowlan: are those because of you? ^^^ lvs alerts
[14:28:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P46113 and previous config saved to /var/cache/conftool/dbconfig/20230406-142826-ladsgroup.json
[14:28:38] <sukhe>	 probably
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 77 connections established with conf1007.eqiad.wmnet:4001 (min=78) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 123 connections established with conf1007.eqiad.wmnet:4001 (min=124) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 73 connections established with conf2005.codfw.wmnet:4001 (min=74) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.80:4972]) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:45] <icinga-wm_>	 ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 91 connections established with conf2004.codfw.wmnet:4001 (min=92) Hnowlan changing state of device-analytics. https://wikitech.wikimedia.org/wiki/PyBal
[14:28:47] <hnowlan>	 volans: just acked 
[14:28:48] <sukhe>	 seems to match the hosts :)
[14:28:51] <volans>	 ack thx
[14:28:59] <volans>	 maybe the cookbook could silence them...
[14:29:47] <icinga-wm_>	 PROBLEM - statsv Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:30:00] <sukhe>	 that'snew
[14:30:03] <icinga-wm_>	 PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:30:55] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967)
[14:30:59] <stashbot>	 T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967
[14:31:19] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 92 connections established with conf2004.codfw.wmnet:4001 (min=92) https://wikitech.wikimedia.org/wiki/PyBal
[14:31:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[14:31:21] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:31:24] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Make deploy-tag compulsory [alerts] - 10https://gerrit.wikimedia.org/r/906581 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[14:31:29] <sukhe>	 volans: I will look at the cp3064 one
[14:31:41] <volans>	 thx
[14:31:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add EVPN protocol config for enabled L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/906529 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[14:31:53] <icinga-wm_>	 RECOVERY - statsv Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:32:00] <sukhe>	 cool didn't even look :)
[14:32:06] <sukhe>	 but still, something must be wrong so
[14:32:11] <icinga-wm_>	 RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[14:32:14] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T320967)
[14:32:57] <mazevedo>	 Lucas_WMDE ty :)
[14:33:57] <logmsgbot>	 !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye
[14:34:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[14:34:52] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi)
[14:35:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: provision k8s-aux volume [puppet] - 10https://gerrit.wikimedia.org/r/906591 (https://phabricator.wikimedia.org/T334192) (owner: 10Filippo Giunchedi)
[14:37:18] <wikibugs>	 (03PS1) 10Ladsgroup: Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593
[14:37:24] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[14:37:56] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967)
[14:38:00] <stashbot>	 T320967: [AQS 2.0] New Service Request device_analytics - https://phabricator.wikimedia.org/T320967
[14:38:19] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 74 connections established with conf2005.codfw.wmnet:4001 (min=74) https://wikitech.wikimedia.org/wiki/PyBal
[14:38:21] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:38:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup)
[14:38:33] <wikibugs>	 (03PS5) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771)
[14:38:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup)
[14:39:03] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 78 connections established with conf1007.eqiad.wmnet:4001 (min=78) https://wikitech.wikimedia.org/wiki/PyBal
[14:39:14] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T320967)
[14:39:16] <wikibugs>	 (03Merged) 10jenkins-bot: Disable DT backend on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906593 (owner: 10Ladsgroup)
[14:39:21] <wikibugs>	 (03PS2) 10Cathal Mooney: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937)
[14:39:27] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:906593|Disable DT backend on enwiki]]
[14:40:43] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937"
[14:40:47] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[14:40:59] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:906593|Disable DT backend on enwiki]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[14:41:01] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[14:41:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add ssw1-e1-eqiad and ssw1-f1-eqiad to homer [homer/public] - 10https://gerrit.wikimedia.org/r/906540 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney)
[14:42:39] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync data for new ssw1 spine switches in eqiad. - cmooney@cumin1001 - T322937"
[14:43:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T333332)', diff saved to https://phabricator.wikimedia.org/P46114 and previous config saved to /var/cache/conftool/dbconfig/20230406-144332-ladsgroup.json
[14:43:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[14:43:37] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[14:43:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[14:44:01] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin: update platform engineering approvers [puppet] - 10https://gerrit.wikimedia.org/r/889967 (https://phabricator.wikimedia.org/T300244) (owner: 10Hnowlan)
[14:44:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[14:44:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance
[14:44:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46115 and previous config saved to /var/cache/conftool/dbconfig/20230406-144437-ladsgroup.json
[14:46:39] <wikibugs>	 (03PS1) 10Jelto: install_server: fix line break in gitlab parman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172)
[14:46:41] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:906593|Disable DT backend on enwiki]] (duration: 07m 14s)
[14:47:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46116 and previous config saved to /var/cache/conftool/dbconfig/20230406-144753-ladsgroup.json
[14:48:43] <wikibugs>	 (03CR) 10Jelto: "I compared this to the other recipes and it seems the end of recipe should not continue with a . \ (only a .) but the different partitions" [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[14:52:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10Elitre) Of course, approved.
[14:52:58] <wikibugs>	 (03CR) 10Muehlenhoff: install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[14:56:17] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs6003: update interface name [puppet] - 10https://gerrit.wikimedia.org/r/906598 (https://phabricator.wikimedia.org/T321309)
[14:57:01] <logmsgbot>	 !log jgiannelos@deploy2002 Started deploy [restbase/deploy@8fb20e9]: (no justification provided)
[14:57:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs6003: update interface name [puppet] - 10https://gerrit.wikimedia.org/r/906598 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:57:21] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs6003.drmrs.wmnet with OS bullseye
[14:57:31] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye executed with errors: - lvs6003 (**FAIL**)   - Downtimed on...
[14:57:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye
[14:57:44] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye
[14:58:22] <wikibugs>	 (03CR) 10Jelto: install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[15:02:55] <wikibugs>	 10SRE, 10Traffic: varnish-frontend-fetcherr: Assert error in vslc_vtx_next, 100% CPU usage - https://phabricator.wikimedia.org/T253093 (10ssingh) 05Resolved→03Open ` Apr 06 14:27:14 cp3064 varnishkafka[1513247]:   Condition(c->offset <= c->vtx->len) not true. Apr 06 14:27:14 cp3064 systemd[1]: varnishkafka...
[15:03:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P46117 and previous config saved to /var/cache/conftool/dbconfig/20230406-150300-ladsgroup.json
[15:04:33] <wikibugs>	 (03PS1) 10BCornwall: sre/systemd: Remove query params from dashboard [alerts] - 10https://gerrit.wikimedia.org/r/906599
[15:07:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you !" [alerts] - 10https://gerrit.wikimedia.org/r/906599 (owner: 10BCornwall)
[15:09:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) >>! In T292095#8715082, @Jclark-ctr wrote: > @cmooney Racks e5-7 f5-7 have been cabled and racked  do you want to use same ticket f...
[15:11:04] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] sre/systemd: Remove query params from dashboard [alerts] - 10https://gerrit.wikimedia.org/r/906599 (owner: 10BCornwall)
[15:16:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[15:17:25] <wikibugs>	 (03PS1) 10Ladsgroup: Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600
[15:18:02] <logmsgbot>	 !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@8fb20e9]: (no justification provided) (duration: 21m 01s)
[15:18:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P46118 and previous config saved to /var/cache/conftool/dbconfig/20230406-151806-ladsgroup.json
[15:18:46] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600 (owner: 10Ladsgroup)
[15:19:30] <wikibugs>	 (03Merged) 10jenkins-bot: Disable writes on group2 for DT backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/906600 (owner: 10Ladsgroup)
[15:19:37] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[15:20:00] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]]
[15:20:31] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@2192f15]: (no justification provided)
[15:20:42] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@2192f15]: (no justification provided) (duration: 00m 11s)
[15:21:19] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[15:21:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] install_server: fix line break in gitlab parman recipe (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[15:28:12] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:906600|Disable writes on group2 for DT backend]] (duration: 08m 11s)
[15:29:18] <wikibugs>	 10SRE, 10Machine-Learning-Team, 10serviceops: Import and deploy istio 1.15.7 - https://phabricator.wikimedia.org/T334068 (10elukey)
[15:29:25] <wikibugs>	 (03PS1) 10Ssingh: admin: add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863)
[15:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T333332)', diff saved to https://phabricator.wikimedia.org/P46119 and previous config saved to /var/cache/conftool/dbconfig/20230406-153312-ladsgroup.json
[15:33:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[15:33:17] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[15:33:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance
[15:33:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46120 and previous config saved to /var/cache/conftool/dbconfig/20230406-153335-ladsgroup.json
[15:35:49] <icinga-wm_>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46121 and previous config saved to /var/cache/conftool/dbconfig/20230406-153602-ladsgroup.json
[15:41:10] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney)
[15:42:10] <wikibugs>	 (03PS1) 10Ssingh: hiera: update lvs6003 interfaces in common/interfaces.yaml [puppet] - 10https://gerrit.wikimedia.org/r/906604 (https://phabricator.wikimedia.org/T321309)
[15:42:47] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bullseye
[15:42:57] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**)   - Downtimed on Icinga/Aler...
[15:43:23] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] install_server: fix line break in gitlab parman recipe [puppet] - 10https://gerrit.wikimedia.org/r/906596 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[15:44:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: update lvs6003 interfaces in common/interfaces.yaml [puppet] - 10https://gerrit.wikimedia.org/r/906604 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[15:51:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P46122 and previous config saved to /var/cache/conftool/dbconfig/20230406-155108-ladsgroup.json
[15:51:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh)
[15:53:49] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs6003.drmrs.wmnet with OS bullseye
[15:53:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye
[15:54:02] <wikibugs>	 (03CR) 10EoghanGaffney: "I've taken care of, I think, all of the comments below (except one TODO about a DNS check, that will come later). Follow-up review would b" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[15:57:05] <icinga-wm_>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:59:45] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] admin: add trizek to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/906601 (https://phabricator.wikimedia.org/T333863) (owner: 10Ssingh)
[16:00:05] <jouncebot>	 jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1600). nyaa~
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh)
[16:02:06] <topranks>	 ^^^ BGP alert on asw1-b12-drmrs relates to reimage of lvs6003 su.khe has kicked off 
[16:02:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for trizek - https://phabricator.wikimedia.org/T333863 (10ssingh) 05Open→03Resolved a:03ssingh @Trizek-WMF: Your access request has been merged. Please try logging in in about 30 minutes and feel free re-open...
[16:05:12] <topranks>	 !log Enable BGP EVPN sessions between eqiad row e/f Leaf and Spine devices 
[16:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:33] <logmsgbot>	 !log jelto@cumin2002 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye
[16:06:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P46123 and previous config saved to /var/cache/conftool/dbconfig/20230406-160614-ladsgroup.json
[16:09:27] <wikibugs>	 (03PS1) 10Cathal Mooney: Puppet additions for ssw1-e1-eqiad and ssw1-f1-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/906627 (https://phabricator.wikimedia.org/T322937)
[16:12:50] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[16:15:48] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs6003.drmrs.wmnet with reason: host reimage
[16:21:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T333332)', diff saved to https://phabricator.wikimedia.org/P46124 and previous config saved to /var/cache/conftool/dbconfig/20230406-162120-ladsgroup.json
[16:21:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:21:25] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[16:21:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2124.codfw.wmnet with reason: Maintenance
[16:21:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46125 and previous config saved to /var/cache/conftool/dbconfig/20230406-162144-ladsgroup.json
[16:24:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46126 and previous config saved to /var/cache/conftool/dbconfig/20230406-162409-ladsgroup.json
[16:31:33] <icinga-wm_>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:34:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs6003.drmrs.wmnet with OS bullseye
[16:34:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs6003.drmrs.wmnet with OS bullseye completed: - lvs6003 (**WARN**)   - Downtimed on Icinga/Aler...
[16:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P46127 and previous config saved to /var/cache/conftool/dbconfig/20230406-163916-ladsgroup.json
[16:41:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs6003.drmrs.wmnet
[16:41:18] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs6003.drmrs.wmnet
[16:54:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P46128 and previous config saved to /var/cache/conftool/dbconfig/20230406-165422-ladsgroup.json
[16:58:53] <wikibugs>	 (03CR) 10Volans: "Much nicer! I found some smaller things that still need fixing, but we should be closed." [cookbooks] - 10https://gerrit.wikimedia.org/r/904510 (owner: 10Slyngshede)
[16:58:59] <logmsgbot>	 !log jelto@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host gitlab2003.wikimedia.org with OS bullseye
[17:00:04] <jouncebot>	 bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1700).
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1700)
[17:02:40] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott)
[17:02:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wikireplica_dns.yaml: move toolsdb DNS to new server in 'tools' project [puppet] - 10https://gerrit.wikimedia.org/r/906053 (https://phabricator.wikimedia.org/T333471) (owner: 10Andrew Bogott)
[17:05:23] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@318480e]: Fix for dump_month_of_daily_pageviews dag - Analytics [airflow-dags@318480e]
[17:05:38] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@318480e]: Fix for dump_month_of_daily_pageviews dag - Analytics [airflow-dags@318480e] (duration: 00m 14s)
[17:09:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T333332)', diff saved to https://phabricator.wikimedia.org/P46129 and previous config saved to /var/cache/conftool/dbconfig/20230406-170928-ladsgroup.json
[17:09:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[17:09:33] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:09:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2141.codfw.wmnet with reason: Maintenance
[17:10:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance
[17:10:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance
[17:10:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46130 and previous config saved to /var/cache/conftool/dbconfig/20230406-171028-ladsgroup.json
[17:12:45] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts lvs3007.esams.wmnet
[17:12:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46131 and previous config saved to /var/cache/conftool/dbconfig/20230406-171254-ladsgroup.json
[17:14:55] <wikibugs>	 (03CR) 10Volans: "thanks for all the fixes, just found couple of minor issues, LGTM otherwise" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[17:15:00] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) Would it be possible to use both serial number and/or asset tag for the match?  I'll follow up with Julianne (she's currently out) regarding the formula being us...
[17:16:30] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) But if the asset tag comes from netbox it will not match anything for future hosts... as the host will not be anymore in Netbox :)
[17:19:11] <wikibugs>	 (03PS1) 10Volans: reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955)
[17:19:48] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) @wiki_willy I've sent the above patch to match on serial instead of asset tag. LMK what do you want to do.
[17:19:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh)
[17:22:52] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts lvs3007.esams.wmnet
[17:24:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10ssingh) (Person on clinic duty here): I initially removed the `SRE-Access-Requests` tag, my apologies, because I thought that this...
[17:28:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P46132 and previous config saved to /var/cache/conftool/dbconfig/20230406-172800-ladsgroup.json
[17:31:55] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down.
[17:32:22] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-e1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-e1-eqiad down.
[17:32:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e7d20917-1f70-4c85-bea4-4fae89694441) set by cmooney@cumin1001 f...
[17:32:37] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-f1-eqiad down.
[17:32:53] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f1-eqiad.mgmt with reason: test on ssw1-e1-eqiad will take ospf on lsw1-f1-eqiad down.
[17:33:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=09fdc8d3-92d3-4c3b-8e46-8c1befa6a846) set by cmooney@cumin1001 f...
[17:33:07] <wikibugs>	 (03CR) 10Volans: [C: 03+2] reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans)
[17:33:35] <wikibugs>	 (03PS1) 10Ssingh: hiera: lvs3007: update iface names for bullseye (esams) [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309)
[17:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: reports: accounting use serial to match recyled [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906635 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans)
[17:34:01] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "@Antoine What do you think, can I just merge it and we see when we get to it? pretty sure we won't want "wmflabs" in there and just need t" [puppet] - 10https://gerrit.wikimedia.org/r/888808 (https://phabricator.wikimedia.org/T329444) (owner: 10Dzahn)
[17:34:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dwisehaupt) @ssingh Thanks for looking at this. This task was created to help capture the output of some ongoing discussions to fi...
[17:34:11] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox
[17:34:18] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox
[17:35:06] <wikibugs>	 (03PS6) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771)
[17:35:18] <wikibugs>	 (03CR) 10EoghanGaffney: Cookbook for switchover of Gitlab to a new host (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[17:35:21] <sukhe>	 BGP alerts in esams expected
[17:36:17] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs3007.esams.wmnet with OS bullseye
[17:36:30] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye
[17:37:05] <icinga-wm_>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:37:12] <sukhe>	 ^ expected
[17:38:01] <wikibugs>	 (03PS11) 10David Caro: maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[17:38:03] <wikibugs>	 (03PS1) 10David Caro: maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637
[17:39:10] <wikibugs>	 (03CR) 10David Caro: "The one adding prometheus will follow this one, wanted to split the click change and the prometheus one" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[17:39:19] <icinga-wm_>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:39:47] <wikibugs>	 (03CR) 10BPirkle: "Is this really the phab task you intended to link?" [puppet] - 10https://gerrit.wikimedia.org/r/906554 (https://phabricator.wikimedia.org/T247997) (owner: 10Muehlenhoff)
[17:39:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10Dzahn) my 2 cents:  Yes, it's possible to resolve this with a new admin group that gets sudo privs to run a particular set of cook...
[17:40:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain_dbusers: move all the files under service [puppet] - 10https://gerrit.wikimedia.org/r/906637 (owner: 10David Caro)
[17:41:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: use click for cli definition [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[17:43:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P46133 and previous config saved to /var/cache/conftool/dbconfig/20230406-174306-ladsgroup.json
[17:45:46] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add MarcoAurelio to #mediawiki_security - https://phabricator.wikimedia.org/T333870 (10Dzahn) Currently the process to sign a new NDA is under way. Once that is confirmed on T333884 it would be a good time to also resolve this ticket.
[17:46:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Dzahn) a:05Ladsgroup→03ItamarWMDE
[17:46:59] <wikibugs>	 (03PS1) 10Volans: reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955)
[17:47:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Dzahn) 05Open→03In progress
[17:47:37] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10API Platform (Sprint 07): Requesting access to analytics-privatedata-users for atieno - https://phabricator.wikimedia.org/T333550 (10Dzahn) 05Open→03In progress
[17:47:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans)
[17:48:41] <icinga-wm_>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:48:42] <wikibugs>	 (03Merged) 10jenkins-bot: reports: accounting convert serial to string [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/906638 (https://phabricator.wikimedia.org/T320955) (owner: 10Volans)
[17:49:52] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.netbox.update-extras rolling update on A:netbox
[17:49:59] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox
[17:51:13] <wikibugs>	 (03CR) 10Ssingh: "Merging before Puppet kicks in :)" [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:51:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: lvs3007: update iface names for bullseye (esams) [puppet] - 10https://gerrit.wikimedia.org/r/906636 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[17:55:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10wiki_willy) 05Open→03Resolved Thanks @Volans.  It looks like we're all set now.  https://netbox.wikimedia.org/extras/reports/results/4443574/
[17:56:09] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM for the cookbook/python stuff, I'll leave it to the gitlab experts for the logic" [cookbooks] - 10https://gerrit.wikimedia.org/r/894634 (https://phabricator.wikimedia.org/T330771) (owner: 10EoghanGaffney)
[17:58:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T333332)', diff saved to https://phabricator.wikimedia.org/P46134 and previous config saved to /var/cache/conftool/dbconfig/20230406-175813-ladsgroup.json
[17:58:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[17:58:18] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[17:58:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance
[17:58:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:58:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:58:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs3007.esams.wmnet with reason: host reimage
[17:58:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46135 and previous config saved to /var/cache/conftool/dbconfig/20230406-175854-ladsgroup.json
[18:00:05] <jouncebot>	 hashar and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T1800).
[18:01:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46136 and previous config saved to /var/cache/conftool/dbconfig/20230406-180119-ladsgroup.json
[18:02:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs3007.esams.wmnet with reason: host reimage
[18:04:55] <wikibugs>	 (03PS3) 10Majavah: Remove osmdb records [dns] - 10https://gerrit.wikimedia.org/r/892901 (https://phabricator.wikimedia.org/T323159)
[18:06:49] <wikibugs>	 (03PS2) 10Majavah: openstack: remove osmdb dns records [puppet] - 10https://gerrit.wikimedia.org/r/892903 (https://phabricator.wikimedia.org/T323159)
[18:06:51] <wikibugs>	 (03PS2) 10Majavah: P:wmcs: remove osmdb classes [puppet] - 10https://gerrit.wikimedia.org/r/892904 (https://phabricator.wikimedia.org/T323159)
[18:06:53] <wikibugs>	 (03PS2) 10Majavah: osm: remove unuseud shapefile_import class [puppet] - 10https://gerrit.wikimedia.org/r/892905
[18:07:29] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[18:16:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P46137 and previous config saved to /var/cache/conftool/dbconfig/20230406-181625-ladsgroup.json
[18:17:53] <wikibugs>	 (03PS1) 10BCornwall: hiera: lvs/interfaces: update 6001 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309)
[18:17:55] <icinga-wm_>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:18:01] <icinga-wm_>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:18:54] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs3007.esams.wmnet with OS bullseye
[18:19:03] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs3007.esams.wmnet with OS bullseye completed: - lvs3007 (**PASS**)   - Downtimed on Icinga/Aler...
[18:20:49] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[18:23:49] <sukhe>	 please hold off making any netbox changes for now 
[18:26:33] <mutante>	 ok
[18:31:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P46138 and previous config saved to /var/cache/conftool/dbconfig/20230406-183132-ladsgroup.json
[18:33:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Thanks @Dzahn. I understood it was  /fnavas-wmf\ not just /fnavas\  Neither of those two or /fnavas-foundation\ allow me to log-in on the wikitech ma...
[18:37:30] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] modules::profile::manifests::airflow.pp: add plugins_folder path [puppet] - 10https://gerrit.wikimedia.org/r/904609 (https://phabricator.wikimedia.org/T324485) (owner: 10Mforns)
[18:38:38] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[18:41:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) Hi!  So,, each LDAP user has multiple fields, uid, sn and cn and depending on whether it's an SSH login, a wiki login or other, confusingly a different one may b...
[18:43:10] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Dzahn) P.S. I did test if wikitech wiki sends out email to myself, and it did. that's why I am saying to check on the ITS side.
[18:46:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T333332)', diff saved to https://phabricator.wikimedia.org/P46139 and previous config saved to /var/cache/conftool/dbconfig/20230406-184638-ladsgroup.json
[18:46:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:46:43] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[18:46:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[18:47:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46140 and previous config saved to /var/cache/conftool/dbconfig/20230406-184701-ladsgroup.json
[18:49:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46141 and previous config saved to /var/cache/conftool/dbconfig/20230406-184929-ladsgroup.json
[18:54:07] <wikibugs>	 (03PS2) 10Ssingh: hiera: lvs/interfaces: update 6001 iface name [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[18:54:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Looks good but let's hold till we figure out the Netbox restore, just in case!" [puppet] - 10https://gerrit.wikimedia.org/r/906640 (https://phabricator.wikimedia.org/T321309) (owner: 10BCornwall)
[19:04:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P46142 and previous config saved to /var/cache/conftool/dbconfig/20230406-190435-ladsgroup.json
[19:14:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P46143 and previous config saved to /var/cache/conftool/dbconfig/20230406-191941-ladsgroup.json
[19:26:34] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@b454afd]: (no justification provided)
[19:26:45] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@b454afd]: (no justification provided) (duration: 00m 11s)
[19:34:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46144 and previous config saved to /var/cache/conftool/dbconfig/20230406-193447-ladsgroup.json
[19:34:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[19:34:53] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[19:35:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2171.codfw.wmnet with reason: Maintenance
[19:35:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46145 and previous config saved to /var/cache/conftool/dbconfig/20230406-193510-ladsgroup.json
[19:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46146 and previous config saved to /var/cache/conftool/dbconfig/20230406-193737-ladsgroup.json
[19:45:27] <icinga-wm_>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:45:38] <wikibugs>	 (03PS2) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834
[19:45:41] <wikibugs>	 (03PS3) 10Aaron Schulz: Set "s3" as the default section name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893834
[19:52:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P46147 and previous config saved to /var/cache/conftool/dbconfig/20230406-195243-ladsgroup.json
[19:59:39] <wikibugs>	 (03PS2) 10Eevans: swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122)
[20:00:05] <jouncebot>	 brennen and TheresNoTime: May I have your attention please! UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230406T2000)
[20:00:19] <thcipriani>	 nary a patch to be found
[20:00:42] <TheresNoTime>	 :D
[20:01:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] swift: add ms-fe101[3-4] as new Swift proxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/906078 (https://phabricator.wikimedia.org/T334122) (owner: 10Eevans)
[20:03:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10BCornwall)
[20:04:32] <brennen>	 this is what i like to see when i belatedly remember it's the backport window.
[20:07:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P46148 and previous config saved to /var/cache/conftool/dbconfig/20230406-200750-ladsgroup.json
[20:09:36] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1013.eqiad.wmnet
[20:09:45] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1014.eqiad.wmnet
[20:10:37] <icinga-wm_>	 PROBLEM - Check systemd state on ms-be1061 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:56] <wikibugs>	 (03CR) 10Ssingh: "To be merged on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/906580 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[20:15:20] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1013.eqiad.wmnet
[20:15:27] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1061 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[20:15:53] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481 (10wiki_willy) a:03Jclark-ctr
[20:16:40] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1014.eqiad.wmnet
[20:17:43] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frdata1001.frack.eqiad.wmnet (WMF7292) - https://phabricator.wikimedia.org/T333971 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[20:19:07] <icinga-wm_>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:19:37] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1004.wikimedia.org (B1) - https://phabricator.wikimedia.org/T333997 (10wiki_willy) a:03Jclark-ctr
[20:20:00] <wikibugs>	 10SRE, 10ops-eqiad, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab1003.wikimedia.org (A3) - https://phabricator.wikimedia.org/T333996 (10wiki_willy) a:03Jclark-ctr
[20:20:55] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering, 10Patch-For-Review: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[20:22:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T333332)', diff saved to https://phabricator.wikimedia.org/P46149 and previous config saved to /var/cache/conftool/dbconfig/20230406-202256-ladsgroup.json
[20:22:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[20:23:01] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[20:23:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance
[20:23:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46150 and previous config saved to /var/cache/conftool/dbconfig/20230406-202319-ladsgroup.json
[20:24:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy)
[20:24:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10wiki_willy)
[20:25:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46151 and previous config saved to /var/cache/conftool/dbconfig/20230406-202535-ladsgroup.json
[20:40:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P46152 and previous config saved to /var/cache/conftool/dbconfig/20230406-204041-ladsgroup.json
[20:41:31] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad
[20:41:57] <wikibugs>	 (03PS4) 10Subramanya Sastry: Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[20:42:23] <wikibugs>	 (03CR) 10Subramanya Sastry: "Rebased and resolved merge conflict." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[20:43:44] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937"
[20:43:48] <stashbot>	 T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937
[20:44:59] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove info for new ssw as need to set back to planned to make homer happy - cmooney@cumin1001 - T322937"
[20:45:45] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad
[20:48:38] <wikibugs>	 10SRE, 10DBA, 10DiscussionTools, 10Wikimedia-production-error: Large increase in insertThreadItems rate leading to db performance issues (was: Greater than average number of DBTransactionStateError/DBQueryErrors) - https://phabricator.wikimedia.org/T334023 (10matmarex)
[20:49:22] <logmsgbot>	 !log eevans@cumin1001 conftool action : set/weight=40; selector: name=ms-fe1013.eqiad.wmnet
[20:49:36] <logmsgbot>	 !log eevans@cumin1001 conftool action : set/weight=40; selector: name=ms-fe1014.eqiad.wmnet
[20:49:56] <logmsgbot>	 !log eevans@cumin1001 conftool action : set/pooled=yes; selector: name=ms-fe1013.eqiad.wmnet
[20:50:01] <logmsgbot>	 !log eevans@cumin1001 conftool action : set/pooled=yes; selector: name=ms-fe1014.eqiad.wmnet
[20:51:28] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Make VE on officewiki use Parsoid directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896104 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[20:52:02] <wikibugs>	 10SRE-swift-storage: Bring ms-fe101[3-4] into service - https://phabricator.wikimedia.org/T334122 (10Eevans) 05Open→03Resolved Done!
[20:53:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:55:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P46153 and previous config saved to /var/cache/conftool/dbconfig/20230406-205548-ladsgroup.json
[20:57:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[20:59:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:00:28] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:00:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:02:31] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:02:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:04:55] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:05:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:07:23] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:07:35] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:10:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T333332)', diff saved to https://phabricator.wikimedia.org/P46154 and previous config saved to /var/cache/conftool/dbconfig/20230406-211054-ladsgroup.json
[21:10:59] <stashbot>	 T333332: Add af_actor/afh_actor fields to wmf wikis - https://phabricator.wikimedia.org/T333332
[21:15:27] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.708 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:15:35] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49851 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:18:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:19:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudvirtlocal1001.mgmt.eqiad.wmnet with reboot policy FORCED
[21:22:40] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095)
[21:23:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns)
[21:30:14] <wikibugs>	 (03PS1) 10Mforns: analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906662 (https://phabricator.wikimedia.org/T334095)
[21:33:51] <wikibugs>	 (03Abandoned) 10Mforns: analytics::refinery::job::druid_load: absent all jobs [puppet] - 10https://gerrit.wikimedia.org/r/906660 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns)
[21:33:56] <wikibugs>	 (03Abandoned) 10Mforns: analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906662 (https://phabricator.wikimedia.org/T334095) (owner: 10Mforns)
[21:37:02] <wikibugs>	 (03PS1) 10Mforns: ::analytics::refinery::job::druid_load: absent remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906665 (https://phabricator.wikimedia.org/T334095)
[21:43:15] <wikibugs>	 (03PS1) 10Mforns: ::analytics::refinery::job::druid_load: Remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667
[21:43:16] <sbassett>	 Hey all - I’d like to fix a small issue in /private on production and deploy - let me know if I should hold off.
[21:43:45] <wikibugs>	 (03PS2) 10Mforns: ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667
[21:46:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (owner: 10Mforns)
[21:47:16] <wikibugs>	 (03PS3) 10Mforns: ::analytics::refinery::job::druid_load: remove remaining jobs [puppet] - 10https://gerrit.wikimedia.org/r/906667 (https://phabricator.wikimedia.org/T334095)
[21:52:53] <sbassett>	 !log Deployed updated mitigation for T333140
[21:52:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Jhancock.wm) 05Open→03Resolved Received the new drive this afternoon.  Worked with Matthew to replace the drive. It seems to be working and no longer throwing errors. Going to...
[22:28:36] <wikibugs>	 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T333503 (10phaultfinder)
[22:32:45] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) @Marostegui would you be able to help me with this swap? If so when would work best for you?
[23:27:14] <wikibugs>	 (03PS1) 10Kevin Bazira: httpbb: Add test cases for trwiki editquality inference services [puppet] - 10https://gerrit.wikimedia.org/r/906687 (https://phabricator.wikimedia.org/T334158)