[00:03:10] <tto>	 hi, just wanting to attract attention to T380729
[00:03:11] <stashbot>	 T380729: 2024-11-20 dump run appears stuck - https://phabricator.wikimedia.org/T380729
[00:03:12] <tto>	 in a few days it will be time for the 2024-12-01 dump to begin, so it would be good if the relevant people could investigate this sooner rather than later
[00:03:15] <tto>	 thanks!
[00:28:03] <greg-g>	 Reedy: hah! nice find
[00:38:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221
[00:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221 (owner: 10TrainBranchBot)
[01:01:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221 (owner: 10TrainBranchBot)
[01:08:25] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231
[01:08:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231 (owner: 10TrainBranchBot)
[01:30:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231 (owner: 10TrainBranchBot)
[01:39:23] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10360298 (10Platonides) It does look good so far.   While it was losing messages at *:40-*:45, in the last two days it has only lost at 11:45-11...
[01:57:31] <jinxer-wm>	 FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[02:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[03:04:11] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:07:06] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:00:28] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:52:28] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:26:36] <icinga-wm>	 PROBLEM - snapshot of x1 in codfw on backupmon1001 is CRITICAL: Last snapshot for x1 at codfw (db2197) taken on 2024-11-27 04:55:03 is 360 GiB, but the previous one was 531 GiB, a change of -32.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:50:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:57:31] <jinxer-wm>	 FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[05:58:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[06:02:35] <wikibugs>	 (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394)
[06:03:04] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian)
[06:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:11:48] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:18:02] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0700)
[07:00:53] <wikibugs>	 (03PS1) 10Abijeet Patro: ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431)
[07:00:56] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897)
[07:01:09] <wikibugs>	 (03PS1) 10Abijeet Patro: Fix illegal access of typed property. [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724)
[07:03:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro)
[07:04:11] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro)
[07:07:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:17:56] <jinxer-wm>	 FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:22:56] <jinxer-wm>	 RESOLVED: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:23:36] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:24:19] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:28:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:33:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:34:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7002.magru.wmnet with OS bookworm
[07:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:52:22] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm)
[07:57:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0800). Please do the needful.
[08:00:05] <jouncebot>	 abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:01:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage
[08:02:13] <kart_>	 abijeet: around?
[08:02:44] <abijeet>	 hello
[08:04:19] <kart_>	 abijeet: can these patches be tested separatly or depends on each other?
[08:04:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add small helper script for checking firewall config for nftables and ferm [puppet] - 10https://gerrit.wikimedia.org/r/1098465
[08:04:36] <abijeet>	 kart_, they can be tested separately
[08:05:01] <kart_>	 I'll start with first patch and once we start deployment, will +2 on 2nd patch.
[08:05:08] <abijeet>	 ok
[08:05:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro)
[08:05:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:07:31] <icinga-wm>	 PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 174128 MB (4% inode=91%): /srv/swift-storage/sdc1 148931 MB (3% inode=90%): /srv/swift-storage/sdf1 178075 MB (4% inode=91%): /srv/swift-storage/sdd1 180759 MB (4% inode=91%): /srv/swift-storage/sdg1 188674 MB (4% inode=91%): /srv/swift-storage/sdh1 176265 MB (4% inode=92%): /srv/swift-storage/sdi1 208144 MB (5% inode=92%): /srv/swift-st
[08:07:31] <icinga-wm>	 j1 171786 MB (4% inode=92%): /srv/swift-storage/sdk1 170758 MB (4% inode=92%): /srv/swift-storage/sdm1 170428 MB (4% inode=92%): /srv/swift-storage/sdn1 180469 MB (4% inode=92%): /srv/swift-storage/sdl1 159027 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops
[08:09:39] <icinga-wm>	 PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 171269 MB (4% inode=92%): /srv/swift-storage/sdc1 170580 MB (4% inode=91%): /srv/swift-storage/sdf1 201463 MB (5% inode=92%): /srv/swift-storage/sdg1 194113 MB (5% inode=92%): /srv/swift-storage/sdd1 171242 MB (4% inode=91%): /srv/swift-storage/sde1 185978 MB (4% inode=92%): /srv/swift-storage/sdi1 177540 MB (4% inode=91%): /srv/swift-st
[08:09:39] <icinga-wm>	 k1 175534 MB (4% inode=92%): /srv/swift-storage/sdj1 168947 MB (4% inode=91%): /srv/swift-storage/sdl1 170915 MB (4% inode=92%): /srv/swift-storage/sdm1 180730 MB (4% inode=91%): /srv/swift-storage/sdn1 146863 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[08:09:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add small helper script for checking firewall config for nftables and ferm [puppet] - 10https://gerrit.wikimedia.org/r/1098465 (owner: 10Muehlenhoff)
[08:14:19] <icinga-wm>	 PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 192288 MB (5% inode=92%): /srv/swift-storage/sdg1 182702 MB (4% inode=91%): /srv/swift-storage/sdc1 186648 MB (4% inode=92%): /srv/swift-storage/sdi1 163047 MB (4% inode=91%): /srv/swift-storage/sde1 158899 MB (4% inode=91%): /srv/swift-storage/sdh1 168798 MB (4% inode=91%): /srv/swift-storage/sdj1 205095 MB (5% inode=92%): /srv/swift-st
[08:14:19] <icinga-wm>	 k1 193241 MB (5% inode=92%): /srv/swift-storage/sdd1 150114 MB (3% inode=90%): /srv/swift-storage/sdm1 176126 MB (4% inode=92%): /srv/swift-storage/sdl1 163831 MB (4% inode=91%): /srv/swift-storage/sdn1 149088 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops
[08:15:56] <wikibugs>	 (03Merged) 10jenkins-bot: Fix illegal access of typed property. [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro)
[08:17:03] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467
[08:17:16] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]]
[08:17:21] <stashbot>	 T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724
[08:17:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 (owner: 10Muehlenhoff)
[08:18:32] <wikibugs>	 (03PS2) 10Muehlenhoff: ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467
[08:19:26] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:22:28] <wikibugs>	 (03Abandoned) 10David Caro: grid: disable hardcoded memory overcmommit on weblight [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro)
[08:22:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] benthos: add benthos for haproxy debug functions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:23:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 (owner: 10Muehlenhoff)
[08:23:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:23:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[08:23:51] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:24:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[08:24:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7002.magru.wmnet with OS bookworm
[08:24:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] hiera: add log ring to cp4039 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:24:50] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:24:54] <stashbot>	 T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724
[08:25:04] <kart_>	 abijeet: Please test.
[08:25:22] <abijeet>	 kart_, ok
[08:27:09] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:29:17] <kart_>	 abijeet: should we +2 2nd patch?
[08:30:52] <abijeet>	 kart_, looks good, we can proceed
[08:31:08] <abijeet>	 and yea, we can +2 2nd patch
[08:31:34] <kart_>	 nice!
[08:31:39] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Continuing with sync
[08:32:13] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro)
[08:35:40] <wikibugs>	 (03CR) 10Fabfur: hiera: add log ring to cp4039 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:36:09] <wikibugs>	 (03PS5) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[08:37:39] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Readd firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1098468
[08:38:19] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]] (duration: 21m 02s)
[08:38:23] <stashbot>	 T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724
[08:39:11] <kart_>	 abijeet: going with 2nd patch..
[08:39:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro)
[08:42:18] <abijeet>	 kart_, ok
[08:43:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[08:45:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm
[08:49:08] <wikibugs>	 (03PS8) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[08:49:28] <wikibugs>	 (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[08:49:35] <wikibugs>	 (03Merged) 10jenkins-bot: ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro)
[08:49:38] <wikibugs>	 (03PS6) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[08:50:02] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]]
[08:50:07] <stashbot>	 T380431: TypeError: this.markDirty is not a function 	 - https://phabricator.wikimedia.org/T380431
[08:51:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10360573 (10elukey) 05Open→03Declined The K8s SIG reviewed this proposal and for the moment it was decided not to proceed with anything...
[08:55:42] <logmsgbot>	 !log kartik@deploy2002 abi, kartik: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:55:47] <stashbot>	 T380431: TypeError: this.markDirty is not a function 	 - https://phabricator.wikimedia.org/T380431
[08:56:41] <kart_>	 abijeet: you can test the patch on mwdebug
[08:56:50] <abijeet>	 kart_, ok
[08:57:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Readd Ganeti role to ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098469 (https://phabricator.wikimedia.org/T376737)
[08:57:56] <abijeet>	 kart_, works fine
[08:59:06] <kart_>	 cool. going ahead!
[08:59:12] <logmsgbot>	 !log kartik@deploy2002 abi, kartik: Continuing with sync
[08:59:15] <wikibugs>	 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10360585 (10kostajh) AIUI the mechansi...
[08:59:45] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360586 (10elukey) I tried to dowload and install perccli == `007.2616.0000.0000` on ms-be2081 but no luck, same...
[09:00:05] <jouncebot>	 hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0900)
[09:02:18] <wikibugs>	 (03PS1) 10Slyngshede: Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344)
[09:03:16] <kart_>	 hashar, andre I'm finishing deployment, will take few minutes..
[09:03:39] <hashar>	 o/
[09:03:42] <hashar>	 yeah take your time
[09:03:52] <hashar>	 I am going to brew a coffee and watch the error log
[09:04:15] <wikibugs>	 (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[09:04:22] <wikibugs>	 (03CR) 10DCausse: [C:03+1] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking)
[09:05:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage
[09:06:09] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]] (duration: 16m 06s)
[09:06:21] <stashbot>	 T380431: TypeError: this.markDirty is not a function 	 - https://phabricator.wikimedia.org/T380431
[09:06:48] <kart_>	 hashar: done. Have a nice day & :coffee
[09:07:07] <kart_>	 abijeet: we are done!
[09:07:21] <hashar>	 that was fast
[09:07:25] <hashar>	 (kind of)
[09:08:50] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360601 (10elukey) I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, b...
[09:09:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage
[09:10:46] <abijeet>	 kart_, thanks
[09:13:55] <wikibugs>	 (03PS9) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[09:14:12] <wikibugs>	 (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[09:14:20] <wikibugs>	 (03PS7) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[09:19:07] <hashar>	 logs look good, I am processing
[09:19:17] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[09:19:44] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664)
[09:19:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot)
[09:20:31] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot)
[09:22:52] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10360605 (10elukey) 05Resolved→03Open @Jhancock.wm hi! We have done a lot of weird tests with these nodes, I think that we should re-run provision for...
[09:26:53] <hashar>	 ok so httpbb failed again
[09:29:44] * hashar files a task
[09:29:57] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360614 (10MoritzMuehlenhoff) >>! In T377853#10360612, @MoritzMuehlenhoff wrote: > There are debs available in th...
[09:30:13] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360612 (10MoritzMuehlenhoff) There are debs available in the Thomas Krenn repo (German server vendor): https://w...
[09:30:29] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10360610 (10elukey) @VRiley-WMF @Jclark-ctr Hi! We are ready to start provisioning these nodes, but the procedure is a little bit more convoluted than th...
[09:30:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[09:31:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[09:31:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm
[09:36:14] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360637 (10MoritzMuehlenhoff) One other option is to try https://github.com/namiltd/megactl with this controller....
[09:39:35] <wikibugs>	 06SRE, 10Deployments, 06serviceops-radar: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#10360638 (10hashar) 05Open→03Resolved a:03RLazarus I am marking this one resolved since the confusing https://totoro.wikimedia.org. URL...
[09:44:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Readd Ganeti role to ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098469 (https://phabricator.wikimedia.org/T376737) (owner: 10Muehlenhoff)
[09:44:59] <logmsgbot>	 !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.5  refs T375664
[09:45:10] <stashbot>	 T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664
[09:45:38] <elukey>	 -15
[09:45:45] <elukey>	 nope :D
[09:46:41] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.provision for host cp7006.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:48:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "readded ganeti nodes in magru - jmm@cumin2002 - T376737"
[09:48:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "readded ganeti nodes in magru - jmm@cumin2002 - T376737"
[09:49:56] <icinga-wm>	 PROBLEM - Host cp7006 is DOWN: PING CRITICAL - Packet loss = 100%
[09:55:08] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7006.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[09:56:00] <icinga-wm>	 RECOVERY - Host cp7006 is UP: PING OK - Packet loss = 0%, RTA = 114.95 ms
[09:57:46] <jinxer-wm>	 FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[09:58:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7001
[09:59:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7001
[09:59:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7002
[10:00:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7002
[10:01:16] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.provision for host cp7008.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:04:30] <icinga-wm>	 PROBLEM - Host cp7008 is DOWN: PING CRITICAL - Packet loss = 100%
[10:08:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet
[10:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:10:19] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7008.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[10:10:40] <icinga-wm>	 RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 115.07 ms
[10:14:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[10:15:18] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:18:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet
[10:19:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet
[10:21:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.addnode: Readd firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1098468 (owner: 10Muehlenhoff)
[10:28:30] <wikibugs>	 (03CR) 10Gehel: [C:03+1] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking)
[10:29:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet
[10:35:54] <wikibugs>	 (03PS1) 10Klausman: ml-lab: Allow users to run nvtop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478
[10:36:42] <icinga-wm>	 PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x1 at eqiad (db1216) taken on 2024-11-27 10:07:57 is 325 GiB, but the previous one was 469 GiB, a change of -30.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[10:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[10:36:49] <wikibugs>	 (03PS2) 10Klausman: ml-lab: Allow users to run nvtop and radeontop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478
[10:39:04] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[10:39:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[10:40:29] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[10:44:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[10:44:40] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[10:44:40] <wikibugs>	 (03CR) 10Arnaudb: "x and m sections are excluded from this alert" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[10:46:44] <wikibugs>	 (03PS8) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332)
[10:54:48] <wikibugs>	 (03PS1) 10Máté Szabó: Add HTTP proxy for IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908)
[10:56:21] <wikibugs>	 (03CR) 10Kosta Harlan: Add HTTP proxy for IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[10:57:07] <wikibugs>	 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10360873 (10mszab...
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1100)
[11:02:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru01 and group B3
[11:03:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru01 and group B3
[11:04:29] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[11:05:36] <wikibugs>	 (03PS2) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908)
[11:05:40] <wikibugs>	 (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[11:07:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:13:09] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs7001.magru.wmnet with reason: T376737
[11:13:23] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs7001.magru.wmnet with reason: T376737
[11:13:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7002.magru.wmnet to cluster magru02 and group B4
[11:14:51] <wikibugs>	 (03CR) 10Kosta Harlan: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[11:15:15] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7002.magru.wmnet to cluster magru02 and group B4
[11:16:02] <xSavitar>	 !log T380875 Ran mwscript-k8s --comment="T380875" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'EMBakeryEquipment' 'Janapanna'
[11:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:10] <stashbot>	 T380875: Unblock stuck global rename of Janapanna - https://phabricator.wikimedia.org/T380875
[11:16:24] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:16:57] <jinxer-wm>	 FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:17:12] <vgutierrez>	 !incidents
[11:17:12] <sirenbot>	 5480 (ACKED)  PyBalBGPUnstable lvs sre (lvs7003:9090 pybal 64600 10.140.0.1 magru)
[11:17:13] <sirenbot>	 5482 (UNACKED)  [10x] ProbeDown sre (probes/service magru)
[11:17:13] <sirenbot>	 5478 (RESOLVED)  [10x] ProbeDown sre (probes/service magru)
[11:17:13] <sirenbot>	 5477 (RESOLVED)  [10x] ProbeDown sre (probes/service magru)
[11:17:13] <sirenbot>	 5475 (RESOLVED)  [7x] ProbeDown sre (probes/service magru)
[11:17:14] <claime>	 hello to you to ncredit
[11:17:19] <claime>	 too*
[11:17:21] <vgutierrez>	 !ack 5482
[11:17:21] <sirenbot>	 5482 (ACKED)  [10x] ProbeDown sre (probes/service magru)
[11:17:31] <hnowlan>	 ty :) 
[11:17:49] <fabfur>	 dowtime it again
[11:17:50] <vgutierrez>	 fabfur: could we get magru alerts silenced accordingly please?
[11:17:52] <vgutierrez>	 thx
[11:18:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to drbd
[11:19:15] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T376737
[11:19:30] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T376737
[11:19:43] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs[7001-7003].magru.wmnet with reason: T376737
[11:19:59] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs[7001-7003].magru.wmnet with reason: T376737
[11:20:24] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:56] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7001.wikimedia.org with reason: T376737
[11:21:10] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7001.wikimedia.org with reason: T376737
[11:21:26] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7002.wikimedia.org with reason: T376737
[11:21:39] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7002.wikimedia.org with reason: T376737
[11:21:57] <jinxer-wm>	 RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:22:06] <jinxer-wm>	 FIRING: [22x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:26:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:27:32] <icinga-wm>	 PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 175438 MB (4% inode=91%): /srv/swift-storage/sdc1 152201 MB (3% inode=90%): /srv/swift-storage/sdf1 179419 MB (4% inode=91%): /srv/swift-storage/sdd1 181544 MB (4% inode=91%): /srv/swift-storage/sdg1 184842 MB (4% inode=91%): /srv/swift-storage/sdh1 175138 MB (4% inode=92%): /srv/swift-storage/sdi1 203095 MB (5% inode=92%): /srv/swift-st
[11:27:32] <icinga-wm>	 j1 173760 MB (4% inode=92%): /srv/swift-storage/sdk1 167809 MB (4% inode=92%): /srv/swift-storage/sdm1 169579 MB (4% inode=92%): /srv/swift-storage/sdn1 185368 MB (4% inode=92%): /srv/swift-storage/sdl1 157703 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops
[11:28:17] <wikibugs>	 (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037)
[11:29:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to drbd
[11:29:34] <icinga-wm>	 PROBLEM - Host ncredir7002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:38] <icinga-wm>	 RECOVERY - Host ncredir7002 is UP: PING OK - Packet loss = 0%, RTA = 115.46 ms
[11:30:01] <Amir1>	 jouncebot: nowandnext
[11:30:02] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1100)
[11:30:02] <jouncebot>	 In 0 hour(s) and 29 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1200)
[11:31:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[11:31:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:31:52] <wikibugs>	 (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup)
[11:32:20] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]]
[11:32:26] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:34:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to drbd
[11:34:40] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] "Just a few tips for improved readability." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[11:38:11] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:38:15] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:38:32] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[11:39:39] <wikibugs>	 (03PS1) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794)
[11:41:03] <wikibugs>	 (03PS3) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908)
[11:41:09] <wikibugs>	 (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[11:43:07] <wikibugs>	 (03CR) 10Kosta Harlan: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[11:45:11] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]] (duration: 12m 51s)
[11:45:16] <stashbot>	 T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037
[11:49:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.62s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:49:19] <wikibugs>	 (03PS1) 10Muehlenhoff: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[11:50:22] <wikibugs>	 (03PS4) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908)
[11:50:29] <wikibugs>	 (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[11:53:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to drbd
[11:54:14] <icinga-wm>	 PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100%
[11:54:14] <icinga-wm>	 PROBLEM - Host bast7001 is DOWN: PING CRITICAL - Packet loss = 100%
[11:54:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.62s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[11:54:28] <icinga-wm>	 RECOVERY - Host bast7001 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms
[11:54:43] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966)
[11:54:47] <vgutierrez>	 ganeti2042 is expected?
[11:57:18] <wikibugs>	 (03PS1) 10AOkoth: mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045)
[11:58:31] <wikibugs>	 (03PS1) 10Mvolz: Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460)
[11:58:42] <icinga-wm>	 RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms
[11:59:41] <kart_>	 OK to deploy cxserver?
[11:59:43] <wikibugs>	 (03PS1) 10Cathal Mooney: Change IP for lvs7003 on public1-b3-magru to 195.200.68.5/27 [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737)
[11:59:54] <vgutierrez>	 moritzm: did ganeti2042 just crashed?
[12:00:05] <jouncebot>	 mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1200)
[12:00:10] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[12:01:00] <wikibugs>	 (03CR) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:01:11] <moritzm>	 vgutierrez: expired downtime
[12:01:17] <moritzm>	 I'll extend it
[12:01:23] <vgutierrez>	 oh ok
[12:01:41] <moritzm>	 the server is freshly procured from Supermicro, but has a broken CPU and DC ops are figuring out the process to get the part replaced
[12:01:49] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[12:01:56] <claime>	 kart_: yep
[12:03:02] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:03:21] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460) (owner: 10Mvolz)
[12:03:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2042.codfw.wmnet with reason: broken CPU
[12:03:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2042.codfw.wmnet with reason: broken CPU
[12:04:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460) (owner: 10Mvolz)
[12:04:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good to me. The established process for changing sudo permissions to have them discussed in the weekly SRE IF meeting, I've ad" [puppet] - 10https://gerrit.wikimedia.org/r/1098478 (owner: 10Klausman)
[12:05:00] <kart_>	 claime: Thanks
[12:05:07] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:05:10] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:05:24] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply
[12:05:27] <kart_>	 ah. Forgot to merge the patch ;)
[12:05:45] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply
[12:06:12] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply
[12:06:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to drbd
[12:06:24] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[12:06:44] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[12:07:02] <wikibugs>	 (03PS17) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701)
[12:07:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[12:07:42] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply
[12:07:46] <wikibugs>	 (03CR) 10Hnowlan: mediawiki: add mercurius features (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[12:08:20] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[12:09:41] <wikibugs>	 (03PS4) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[12:11:02] <wikibugs>	 (03CR) 10Arnaudb: "I've kept the annotations on the critical threshold, they have different summaries (90% vs 99%). Please lmk if its not ok!" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:11:50] <wikibugs>	 (03PS5) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[12:12:19] <moritzm>	 !log installing openssl security updates
[12:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:53] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot)
[12:13:39] <wikibugs>	 (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838)
[12:13:45] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:13:56] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot)
[12:13:56] <wikibugs>	 (03CR) 10Effie Mouzeli: "Since we will be using the mesh, shall we enable it in the fixtures?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:14:09] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:16:03] <wikibugs>	 (03CR) 10Clément Goubert: "I did, then forgot to upload the PS. Incoming." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[12:18:46] <moritzm>	 !log installing python-cryptography security updates
[12:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:07] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:20:35] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:22:19] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:22:53] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:24:17] <kart_>	 !log Updated cxserver to 2024-11-20-121713-production (T377966, T357950)
[12:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:22] <stashbot>	 T377966: cxserver: Logstash entries seems difficult to read - https://phabricator.wikimedia.org/T377966
[12:24:22] <stashbot>	 T357950: Remove  servicerunner dependency for cxserver - https://phabricator.wikimedia.org/T357950
[12:24:27] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:24:51] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:26:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "nice and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney)
[12:26:24] <wikibugs>	 (03CR) 10Marostegui: "This needs more thinking, especially the alerting. Please do not merge yet." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:26:38] <icinga-wm>	 PROBLEM - Host prometheus7001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:29:06] <wikibugs>	 (03PS7) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555)
[12:31:21] <wikibugs>	 (03PS3) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974)
[12:31:47] <wikibugs>	 (03PS6) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[12:32:01] <wikibugs>	 (03CR) 10Arnaudb: "- I've established a more descriptive baseline" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:32:30] <wikibugs>	 (03CR) 10Anzx: [C:03+1] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[12:32:52] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[12:34:18] <wikibugs>	 (03CR) 10Anzx: [C:04-1] "there are some unrelated changes to azwikiqoute, you need to fix it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[12:34:20] <wikibugs>	 (03CR) 10Arnaudb: "* I've established a more descriptive baseline → misphrasing went through:" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:35:38] <wikibugs>	 (03CR) 10Marostegui: "We should make this a conditional and only start sending warnings (I don't think we need a critical) if the Uptime is higher than XX (to b" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:35:53] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Change IP for lvs7003 on public1-b3-magru to 195.200.68.5/27 [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney)
[12:36:05] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry)
[12:36:38] <wikibugs>	 (03CR) 10Arnaudb: "good idea, I'll go in that direction, transitioning to WIP in the meantime" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[12:37:08] <wikibugs>	 (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry)
[12:38:13] <effie>	 !log start replacing kafka-main1002 with kafka-main1007 - T363214
[12:38:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:17] <stashbot>	 T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214
[12:39:11] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:47:34] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972)
[12:48:13] <wikibugs>	 (03CR) 10CI reject: [V:04-1] openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez)
[12:48:41] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972)
[12:49:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez)
[12:49:49] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: disable webVideoTranscodePrioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701)
[12:50:21] <moritzm>	 !log installing ghostscript security updates
[12:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:31] <wikibugs>	 (03PS18) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701)
[12:56:32] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh
[12:56:36] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh
[12:57:31] <jinxer-wm>	 RESOLVED: Not accepting/receiving prefixes from anycast BGP peer: Device asw1-b4-magru.mgmt.magru.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer   - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer
[13:01:56] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez)
[13:03:58] <kostajh>	 jouncebot: nowandnext
[13:03:58] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 56 minute(s)
[13:03:58] <jouncebot>	 In 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400)
[13:04:18] <kostajh>	 mszabo and I will deploy some operations/mediawiki-config changes, unless anyone objects
[13:05:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to drbd
[13:05:24] <icinga-wm>	 RECOVERY - Host prometheus7001 is UP: PING OK - Packet loss = 0%, RTA = 115.41 ms
[13:05:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to drbd
[13:06:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez)
[13:08:22] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:08:50] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:09:45] <wikibugs>	 (03PS1) 10KartikMistry: Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504
[13:09:46] <mvolz>	 Does anyone know if you need to update the chart version even if the chart doesn't change, to pull through a config.prod.yaml change in the build itself? 
[13:10:22] <mvolz>	 Had a deploy not work and wondering if it's because I only updated the build and not the chart? 
[13:12:36] <wikibugs>	 (03CR) 10JMeybohm: "Sorry if I'm being too picky here. I feel like this increases the complexity of the chart by quite a bit (it has to, I'm aware) and I had " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[13:13:26] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504 (owner: 10KartikMistry)
[13:13:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:14:34] <wikibugs>	 (03Merged) 10jenkins-bot: Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504 (owner: 10KartikMistry)
[13:14:38] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] "@akosiaris@wikimedia.org - this change had a change to config.dev.yaml... does the chart need to be incremented to pull this through?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot)
[13:15:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to drbd
[13:15:38] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] "* and config.prod.yaml, the relevant one :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot)
[13:15:40] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[13:15:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede)
[13:15:50] <icinga-wm>	 PROBLEM - Host doh7002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:16:00] <icinga-wm>	 RECOVERY - Host doh7002 is UP: PING OK - Packet loss = 0%, RTA = 115.54 ms
[13:16:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to drbd
[13:17:04] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:17:05] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:17:38] <wikibugs>	 (03CR) 10NMW03: "The script automatically did that, it made changes to the array order and the sizes of some logos (which are expected to be done automatic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[13:18:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[13:18:36] <wikibugs>	 (03PS1) 10Máté Szabó: private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868)
[13:18:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:18:49] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó)
[13:19:04] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:19:04] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7002 is OK: OK: UP (pid=2379) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:20:50] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye
[13:20:57] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[13:20:58] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye
[13:22:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó)
[13:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[13:23:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[13:23:41] <wikibugs>	 (03Merged) 10jenkins-bot: private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó)
[13:23:45] <wikibugs>	 (03Merged) 10jenkins-bot: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó)
[13:23:47] <wikibugs>	 (03Merged) 10jenkins-bot: Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[13:24:15] <logmsgbot>	 !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]]
[13:24:23] <stashbot>	 T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868
[13:24:23] <stashbot>	 T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908
[13:24:23] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[13:25:55] <wikibugs>	 (03CR) 10Anzx: [C:04-1] "some of image width are odd in unrelated changes other than azwikiquote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[13:26:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to drbd
[13:26:38] <icinga-wm>	 PROBLEM - Host durum7002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:27:32] <icinga-wm>	 RECOVERY - Host durum7002 is UP: PING OK - Packet loss = 0%, RTA = 115.56 ms
[13:27:57] <moritzm>	 !log rebalance magru02 following switch of VMs back to DRBD T376737
[13:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:46] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:28:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to drbd
[13:29:08] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:30:06] <logmsgbot>	 !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:30:08] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2431) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[13:30:18] <stashbot>	 T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868
[13:30:18] <stashbot>	 T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908
[13:30:19] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[13:30:30] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:30:46] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:30:50] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:31:19] <logmsgbot>	 !log mszabo@deploy2002 mszabo: Continuing with sync
[13:35:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gnmic in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:37:23] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10361480 (10MoritzMuehlenhoff)
[13:38:08] <logmsgbot>	 !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]] (duration: 13m 53s)
[13:38:19] <stashbot>	 T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868
[13:38:20] <stashbot>	 T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908
[13:38:22] <stashbot>	 T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823
[13:39:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to drbd
[13:39:48] <icinga-wm>	 PROBLEM - Host netflow7001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:42] <jinxer-wm>	 FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:40:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to drbd
[13:40:48] <icinga-wm>	 RECOVERY - Host netflow7001 is UP: PING OK - Packet loss = 0%, RTA = 115.78 ms
[13:41:30] <wikibugs>	 (03PS7) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589)
[13:41:30] <wikibugs>	 (03CR) 10Arnaudb: "XX has been set to 3600s (1hr) → please let me know if its not properly adjusted." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[13:41:52] <wikibugs>	 (03PS4) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974)
[13:41:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361494 (10aborrero) the server has been drained and is ready for a reboot when you need it.
[13:42:10] <mszabo>	 backport looks okay
[13:42:54] <wikibugs>	 (03CR) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb)
[13:43:07] <wikibugs>	 (03CR) 10NMW03: "Manually removed that part. Again, that was not me who changed that and I think it is quite normal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[13:43:32] <wikibugs>	 (03CR) 10Anzx: [C:03+1] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[13:44:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede)
[13:45:42] <jinxer-wm>	 RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:45:54] <moritzm>	 !log installing php8.2 security updates
[13:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:50:57] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:53:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede)
[13:54:39] <Nemoralis>	 jouncebot: next
[13:54:39] <jouncebot>	 In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400)
[13:55:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to drbd
[13:56:14] <icinga-wm>	 PROBLEM - Host install7001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:56:24] <icinga-wm>	 RECOVERY - Host install7001 is UP: PING OK - Packet loss = 0%, RTA = 115.63 ms
[13:57:10] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:57:19] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:59:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to drbd
[13:59:11] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400).
[14:00:04] <jouncebot>	 cscott and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <Nemoralis>	 o/
[14:00:14] <urbanecm>	 i can deploy today
[14:00:16] <Lucas_WMDE>	 o/
[14:00:20] <cscott>	 o/
[14:00:36] <Nemoralis>	 hi martin o/ thanks
[14:00:50] <wikibugs>	 (03PS2) 10C. Scott Ananian: Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484)
[14:00:52] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[14:01:13] <wikibugs>	 (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394)
[14:01:16] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian)
[14:01:25] <wikibugs>	 (03PS1) 10Jgreen: Records for analytics*.frdev for consistency and new service. [dns] - 10https://gerrit.wikimedia.org/r/1098508 (https://phabricator.wikimedia.org/T377363)
[14:01:42] <wikibugs>	 (03PS5) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974)
[14:01:45] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[14:02:02] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian)
[14:02:07] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian)
[14:02:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[14:02:32] <wikibugs>	 (03Merged) 10jenkins-bot: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03)
[14:03:01] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]]
[14:03:10] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[14:03:10] <stashbot>	 T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394
[14:03:10] <stashbot>	 T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401
[14:03:11] <stashbot>	 T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974
[14:03:29] <wikibugs>	 (03PS3) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939)
[14:03:33] <wikibugs>	 (03CR) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm)
[14:04:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:04:12] <wikibugs>	 (03PS1) 10Slyngshede: Migrate UI customizations to a theme [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1098510 (https://phabricator.wikimedia.org/T380172)
[14:04:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede)
[14:05:12] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:05:34] <wikibugs>	 (03PS2) 10Abijeet Patro: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386)
[14:06:29] <wikibugs>	 (03PS1) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511
[14:06:29] <wikibugs>	 (03PS1) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[14:08:49] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, cscott, nmw03: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:08:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to drbd
[14:08:54] <icinga-wm>	 PROBLEM - Host ncredir7001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:08:57] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[14:08:57] <stashbot>	 T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394
[14:08:57] <stashbot>	 T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401
[14:08:58] <stashbot>	 T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974
[14:09:03] <urbanecm>	 cscott: Nemoralis: can you test your patches, please?
[14:09:09] <cscott>	 oh it
[14:09:14] <icinga-wm>	 RECOVERY - Host ncredir7001 is UP: PING OK - Packet loss = 0%, RTA = 115.68 ms
[14:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:09:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to drbd
[14:10:12] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:10:16] <Nemoralis>	 urbanecm: LGTM
[14:10:23] <urbanecm>	 ty
[14:11:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361606 (10Jclark-ctr) Dell rejected parts request opening new ticket with them 201666996
[14:12:14] <wikibugs>	 (03PS2) 10Muehlenhoff: debmonitor: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093350
[14:12:28] <icinga-wm>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:12:30] <icinga-wm>	 PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:13:32] <cscott>	 urbanecm: both of my patches look good to me on canaries
[14:13:41] <urbanecm>	 great! proceeding
[14:13:44] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm, cscott, nmw03: Continuing with sync
[14:14:03] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm)
[14:14:47] <wikibugs>	 (03Merged) 10jenkins-bot: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm)
[14:15:54] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010)
[14:16:03] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010)
[14:17:43] <wikibugs>	 (03CR) 10Jelto: "I left some comments in-line. Keep in mind to also bump the `version` in `Charts.yaml`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth)
[14:19:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776
[14:19:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to drbd
[14:20:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to drbd
[14:20:06] <icinga-wm>	 PROBLEM - Host durum7001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:22] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]] (duration: 17m 20s)
[14:20:28] <icinga-wm>	 RECOVERY - Host durum7001 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms
[14:20:29] <stashbot>	 T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484
[14:20:29] <stashbot>	 T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394
[14:20:30] <stashbot>	 T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401
[14:20:30] <stashbot>	 T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974
[14:20:30] <urbanecm>	 Nemoralis: cscott: pushed to prod!
[14:20:36] <urbanecm>	 anything else?
[14:20:38] <cscott>	 urbanecm: thank you!
[14:20:42] <urbanecm>	 any time
[14:20:42] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff)
[14:20:59] <logmsgbot>	 !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]]
[14:21:04] <stashbot>	 T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939
[14:21:18] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye
[14:21:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10361675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye
[14:21:31] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye
[14:21:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10361676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye
[14:22:06] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:22:06] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[14:22:42] <Nemoralis>	 urbanecm: thanks
[14:23:49] <Nemoralis>	 urbanecm: is there any cache for the logos?
[14:24:00] <urbanecm>	 thanks for the reminder, let me purge
[14:24:06] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:24:06] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7001 is OK: OK: UP (pid=2388) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[14:24:09] <Nemoralis>	 it works fine in test server, but prod is still old
[14:24:42] <urbanecm>	 !log Purge https://en.wikipedia.org/static/images/mobile/copyright/wikiquote-wordmark-az.svg (T380974)
[14:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:47] <urbanecm>	 Nemoralis: what about now?
[14:25:00] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673
[14:25:04] <stashbot>	 T380673: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673
[14:25:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361683 (10Jclark-ctr) Finished with bios update waiting on dell  for response  for new ticket
[14:25:13] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673
[14:25:16] <Nemoralis>	 urbanecm: nice, thanks!
[14:26:27] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:26:32] <stashbot>	 T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939
[14:26:35] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[14:27:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361703 (10fnegri) 05Open→03In progress
[14:30:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to drbd
[14:31:48] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[14:32:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:33:04] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:33:13] <sukhe>	 ^ downtiming and will check
[14:33:21] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]] (duration: 12m 21s)
[14:33:22] <sukhe>	 probably related to the wowrk moritzm is doing
[14:33:25] <stashbot>	 T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939
[14:33:33] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on doh[7001-7002].wikimedia.org with reason: site is depooled, maintenance
[14:33:48] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on doh[7001-7002].wikimedia.org with reason: site is depooled, maintenance
[14:34:04] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:34:28] <icinga-wm>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:34:30] <icinga-wm>	 RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:34:48] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7001 is OK: OK: UP (pid=2388) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[14:35:28] <moritzm>	 sukhe: yeah, that was the switch of the VM back to DRBD
[14:35:40] <moritzm>	 !log rebalance magru01 following switch of VMs back to DRBD T376737
[14:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[14:39:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet
[14:39:45] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361778 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs
[14:41:24] <wikibugs>	 (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525
[14:43:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet
[14:44:08] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525 (owner: 10KartikMistry)
[14:45:10] <wikibugs>	 (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525 (owner: 10KartikMistry)
[14:47:19] <wikibugs>	 (03PS3) 10Elukey: admin: add Jimmy Ly's account [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525)
[14:48:15] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:48:28] <kart_>	 Deploying rec-api ^^
[14:49:08] <wikibugs>	 (03CR) 10Elukey: [C:04-1] "Still missing the approval for the deployment group, waiting for Tyler's +1 in the task." [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) (owner: 10Elukey)
[14:51:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to drbd
[14:51:54] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10361832 (10elukey) >>! In T380525#10357542, @Jly wrote: > @elukey Got it, I have updated the key now, please see  All good thanks! I updated the c...
[14:52:08] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10361834 (10elukey)
[14:52:13] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361838 (10ops-monitoring-bot) VM ml-etcd1003.eqiad.wmnet switching disk type to drbd
[14:52:30] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10361835 (10elukey) 05Open→03Resolved a:03elukey Closing for the moment, please re-open if needed!
[14:54:48] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:56:35] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10361858 (10elukey) 05Open→03Resolved a:03elukey ` elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep jly member: uid=jly,ou=people,dc=wikimedia,dc=org `  Added!
[14:58:10] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[14:59:19] <logmsgbot>	 !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[15:00:48] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1500)
[15:01:04] <wikibugs>	 (03CR) 10Vgutierrez: "canonical domains are curated in hiera under the key `wikimedia_domains` defined on hieradata/common.yaml, any chance of reusing it?" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall)
[15:01:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to drbd
[15:01:08] <icinga-wm>	 PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[15:01:28] <icinga-wm>	 RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[15:02:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet
[15:02:43] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361888 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs
[15:02:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet
[15:03:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[15:03:55] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361893 (10ops-monitoring-bot) VM ml-etcd1003.eqiad.wmnet switching disk type to plain
[15:04:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to plain
[15:05:23] <kart_>	 !log Updated recommendation-api to 2024-11-27-142924-production (T380838, T379036, T380699)
[15:05:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:29] <wikibugs>	 (03PS8) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555)
[15:05:30] <stashbot>	 T380838: recommendation API server fails to fill cache  - https://phabricator.wikimedia.org/T380838
[15:05:30] <stashbot>	 T379036: Update cache in a single thread - https://phabricator.wikimedia.org/T379036
[15:05:31] <stashbot>	 T380699: recommendation-api /api/v1/translation/page-collections throws 500 when cache is empty - https://phabricator.wikimedia.org/T380699
[15:05:42] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[15:05:43] <wikibugs>	 (03PS2) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511
[15:05:43] <wikibugs>	 (03PS2) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[15:05:43] <wikibugs>	 (03PS1) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530
[15:05:43] <wikibugs>	 (03PS1) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531
[15:05:52] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester)
[15:06:56] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester)
[15:08:15] <Krinkle>	 !log krinkle@webperf2003: `sudo apt-get install kafkacat` (matching webperf1003, for ad-hoc debugging)
[15:08:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:47] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:09:39] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T380097#10361923 (10elukey) 05Open→03Resolved a:03elukey @NCreasy the wmf group should be enough, you are free to play with DataHub, all perms should be set. If you find any issue please re-open t...
[15:09:58] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:13:03] <wikibugs>	 (03PS19) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701)
[15:13:57] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10361941 (10elukey) Hi! I am looping in @KFrancis since afaics we need to sign an NDA before proceeding.  @KFrancis could you please take a look? Thanks in advance :)
[15:15:21] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol)
[15:15:39] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol)
[15:16:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] puppetboard: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093873 (owner: 10Muehlenhoff)
[15:20:21] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:20:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994 (10SuzanneWood-WMDE) 03NEW
[15:21:10] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:21:31] <wikibugs>	 (03PS2) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531
[15:22:02] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:22:54] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:25:49] <wikibugs>	 (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester)
[15:26:58] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester)
[15:27:36] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:28:33] <logmsgbot>	 !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:28:45] <wikibugs>	 (03PS1) 10Elukey: admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820)
[15:28:54] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:28:54] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:28:56] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp7010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[15:28:59] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[15:29:24] <icinga-wm>	 PROBLEM - SSH on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:30:03] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:30:12] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Access to Data Hub - IAckerman-WMF - https://phabricator.wikimedia.org/T380091#10362054 (10elukey) 05Open→03Resolved a:03elukey ` elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep iacke member: uid=iackerman,ou=people,dc=wikimedia,dc=org `  Added! I don't think tha...
[15:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff)
[15:30:52] <logmsgbot>	 !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:31:04] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:31:26] <icinga-wm>	 RECOVERY - SSH on cp7010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:31:54] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.0 200 OK - 36187 bytes in 0.407 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:31:54] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.1 200 OK - 48376 bytes in 0.465 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[15:31:56] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp7010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 549067 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-10-17 23:59:59 +0000 (expires in 324 days) https://wikitech.wikimedia.org/wiki/HTTPS
[15:31:58] <icinga-wm>	 RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 549065 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-10-17 23:59:59 +0000 (expires in 324 days) https://wikitech.wikimedia.org/wiki/HTTPS
[15:32:14] <logmsgbot>	 !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:32:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey)
[15:32:56] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:33:10] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:33:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey)
[15:33:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71215 and previous config saved to /var/cache/conftool/dbconfig/20241127-153316-ladsgroup.json
[15:33:22] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:36:02] <wikibugs>	 (03PS3) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531
[15:37:00] <wikibugs>	 (03PS3) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647)
[15:37:02] <wikibugs>	 (03PS3) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[15:37:02] <wikibugs>	 (03PS2) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[15:37:03] <wikibugs>	 (03PS4) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[15:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:41:32] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1026.eqiad.wmnet with OS bullseye
[15:41:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors...
[15:41:44] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye
[15:41:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors...
[15:42:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff)
[15:48:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71216 and previous config saved to /var/cache/conftool/dbconfig/20241127-154823-ladsgroup.json
[15:48:28] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[15:49:56] <wikibugs>	 (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214)
[15:51:07] <wikibugs>	 (03PS2) 10Brouberol: airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765)
[15:51:11] <wikibugs>	 (03CR) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian)
[15:51:27] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[15:51:34] <wikibugs>	 (03PS3) 10Brouberol: airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765)
[15:52:12] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) (owner: 10Brouberol)
[15:55:11] <wikibugs>	 (03PS6) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175)
[15:55:45] <wikibugs>	 (03CR) 10David Caro: Example of QoS rules for cloudcephosd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[15:56:04] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[15:58:06] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[15:59:17] <wikibugs>	 (03PS1) 10C. Scott Ananian: Allow defaulting to Parsoid Read Views when MobileFrontEnd is active [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549
[15:59:19] <wikibugs>	 (03CR) 10Majavah: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[15:59:34] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) (owner: 10Brouberol)
[15:59:43] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[16:00:13] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:04-2] "On hold pending train deploy of the Depends-On patch and the December deployment freeze." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (owner: 10C. Scott Ananian)
[16:02:28] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175)
[16:02:28] <wikibugs>	 (03PS11) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175)
[16:02:28] <wikibugs>	 (03PS10) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175)
[16:03:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71217 and previous config saved to /var/cache/conftool/dbconfig/20241127-160330-ladsgroup.json
[16:04:34] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[16:04:36] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[16:05:24] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.dns.netbox
[16:11:03] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp70101 - fabfur@cumin1002"
[16:11:07] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp70101 - fabfur@cumin1002"
[16:11:08] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:11:28] <moritzm>	 !log installing distro-info-data updates from bookworm point release
[16:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:48] <effie>	 !log roll restarting kafka-main brokers - T363214
[16:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:52] <stashbot>	 T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214
[16:13:20] <wikibugs>	 (03PS1) 10Fabfur: hiera: fix magru ip addresses during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307)
[16:15:21] <wikibugs>	 (03CR) 10Ssingh: hiera: fix magru ip addresses during migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur)
[16:15:38] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:16:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10362308 (10MoritzMuehlenhoff)
[16:16:25] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad
[16:16:28] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:16:51] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[16:17:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:17:28] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:18:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71218 and previous config saved to /var/cache/conftool/dbconfig/20241127-161837-ladsgroup.json
[16:18:42] <wikibugs>	 (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey)
[16:19:03] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::service_proxy::envoy: add tegola [puppet] - 10https://gerrit.wikimedia.org/r/1097333 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey)
[16:19:05] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Cover one more case in the setup of Envoy firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1082806 (owner: 10Muehlenhoff)
[16:19:22] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362331 (10Fabfur)
[16:22:32] <wikibugs>	 (03PS2) 10C. Scott Ananian: Allow defaulting to Parsoid Read Views when MobileFrontEnd is active [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (https://phabricator.wikimedia.org/T381002)
[16:23:48] <wikibugs>	 (03PS2) 10Fabfur: hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307)
[16:24:00] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur)
[16:24:17] <wikibugs>	 (03CR) 10Fabfur: hiera: fix magru dns7001 ip address during migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur)
[16:24:57] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur)
[16:25:15] <wikibugs>	 (03PS1) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556
[16:26:13] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7010.magru.wmnet with OS bullseye
[16:26:26] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7010...
[16:26:37] <wikibugs>	 (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[16:27:05] <wikibugs>	 (03PS2) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556
[16:27:45] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045) (owner: 10AOkoth)
[16:27:59] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad
[16:28:50] <wikibugs>	 (03CR) 10Muehlenhoff: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[16:28:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[16:33:44] <wikibugs>	 (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff)
[16:33:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71220 and previous config saved to /var/cache/conftool/dbconfig/20241127-163344-ladsgroup.json
[16:33:47] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[16:33:51] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:34:00] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[16:34:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71221 and previous config saved to /var/cache/conftool/dbconfig/20241127-163407-ladsgroup.json
[16:35:07] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10362431 (10fnegri)
[16:36:08] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] k8s.reboot-nodes: Limit allowed aliases to those of the k8s cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm)
[16:36:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign builder role to build2002 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343)
[16:39:02] <wikibugs>	 (03PS4) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647)
[16:39:02] <wikibugs>	 (03PS4) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[16:39:03] <wikibugs>	 (03PS3) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[16:39:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[16:39:03] <wikibugs>	 (03PS5) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[16:40:50] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking)
[16:40:54] <wikibugs>	 (03CR) 10Dwisehaupt: [C:03+2] Records for analytics*.frdev for consistency and new service. [dns] - 10https://gerrit.wikimedia.org/r/1098508 (https://phabricator.wikimedia.org/T377363) (owner: 10Jgreen)
[16:41:18] <wikibugs>	 (03PS2) 10Muehlenhoff: Assign builder role to build2002 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343)
[16:41:50] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214)
[16:41:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff)
[16:42:19] <wikibugs>	 (03Merged) 10jenkins-bot: k8s.reboot-nodes: Limit allowed aliases to those of the k8s cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm)
[16:43:07] <wikibugs>	 (03PS5) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647)
[16:43:07] <wikibugs>	 (03PS5) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[16:43:07] <wikibugs>	 (03PS4) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[16:43:07] <wikibugs>	 (03PS6) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[16:45:18] <wikibugs>	 (03PS6) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647)
[16:45:18] <wikibugs>	 (03PS6) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647)
[16:45:18] <wikibugs>	 (03PS5) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647)
[16:45:18] <wikibugs>	 (03PS7) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647)
[16:47:14] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[16:47:18] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7010.magru.wmnet with reason: host reimage
[16:48:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71222 and previous config saved to /var/cache/conftool/dbconfig/20241127-164843-ladsgroup.json
[16:48:48] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[16:49:43] <wikibugs>	 (03PS1) 10Máté Szabó: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599)
[16:51:05] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7010.magru.wmnet with reason: host reimage
[16:51:11] <wikibugs>	 06SRE, 10SRE-tools, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10362546 (10bking) 05Open→03Resolved a:03bking
[16:52:05] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:52:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:52:34] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:52:43] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:52:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:53:19] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:53:21] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:53:35] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:53:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[16:54:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[16:54:16] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[16:54:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[16:54:35] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:54:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff)
[16:54:50] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:54:51] <logmsgbot>	 !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:55:26] <logmsgbot>	 !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[16:55:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[16:56:06] <logmsgbot>	 !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[17:03:22] <wikibugs>	 (03Abandoned) 10Raymond Ndibe: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[17:03:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71224 and previous config saved to /var/cache/conftool/dbconfig/20241127-170350-ladsgroup.json
[17:06:01] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: checker: Update Redis address [puppet] - 10https://gerrit.wikimedia.org/r/1098564
[17:06:01] <wikibugs>	 (03PS1) 10Majavah: P:toolforge: Update ToolsDB address [puppet] - 10https://gerrit.wikimedia.org/r/1098565
[17:07:34] <wikibugs>	 (03PS1) 10C. Scott Ananian: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567
[17:07:56] <wikibugs>	 (03PS2) 10C. Scott Ananian: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977)
[17:08:47] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM, there's also some stuff in the secrets repo" [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) (owner: 10Andrew Bogott)
[17:08:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian)
[17:09:06] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: checker: Update Redis address [puppet] - 10https://gerrit.wikimedia.org/r/1098564 (owner: 10Majavah)
[17:09:12] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] Reenable non-UI experiment quick survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang)
[17:09:12] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: Update ToolsDB address [puppet] - 10https://gerrit.wikimedia.org/r/1098565 (owner: 10Majavah)
[17:14:28] <wikibugs>	 (03PS1) 10Majavah: hieradata: Drop eqiad.wmflabs from DNS search domains [puppet] - 10https://gerrit.wikimedia.org/r/1098571 (https://phabricator.wikimedia.org/T305834)
[17:14:44] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main1007.eqiad.wmnet
[17:14:45] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main1007.eqiad.wmnet
[17:15:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[17:16:21] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply
[17:16:24] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply
[17:16:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[17:17:03] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7010.magru.wmnet with OS bullseye
[17:17:07] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bulls...
[17:17:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "This is a heavy change to the chart, to the point I am wondering whether all these if clauses will pose a burden to us in the future. Anyw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[17:18:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71225 and previous config saved to /var/cache/conftool/dbconfig/20241127-171857-ladsgroup.json
[17:19:32] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply
[17:20:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply
[17:20:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian)
[17:22:03] <wikibugs>	 (03PS1) 10C. Scott Ananian: Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769)
[17:22:36] <wikibugs>	 (03PS1) 10Fabfur: Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573
[17:22:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] hieradata: Drop eqiad.wmflabs from DNS search domains [puppet] - 10https://gerrit.wikimedia.org/r/1098571 (https://phabricator.wikimedia.org/T305834) (owner: 10Majavah)
[17:23:51] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[17:23:54] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573 (owner: 10Fabfur)
[17:24:34] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:24:36] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[17:25:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:27:16] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[17:27:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[17:27:56] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[17:28:27] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[17:31:20] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[17:31:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[17:31:54] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[17:31:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[17:32:06] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[17:32:08] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[17:32:21] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:32:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[17:32:29] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:32:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[17:34:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71226 and previous config saved to /var/cache/conftool/dbconfig/20241127-173403-ladsgroup.json
[17:34:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[17:34:10] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:34:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[17:34:27] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71227 and previous config saved to /var/cache/conftool/dbconfig/20241127-173426-ladsgroup.json
[17:35:36] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey)
[17:39:04] <wikibugs>	 (03PS1) 10Chlod Alejandro: Increase Nuke max age to 90 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846)
[17:40:15] <icinga-wm>	 PROBLEM - Host lvs7003 is DOWN: PING CRITICAL - Packet loss = 100%
[17:41:33] <icinga-wm>	 RECOVERY - Host lvs7003 is UP: PING OK - Packet loss = 0%, RTA = 115.12 ms
[17:42:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:44:16] <icinga-wm>	 PROBLEM - pybal on lvs7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:44:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs7003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:44:49] <icinga-wm>	 PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:45:16] <icinga-wm>	 RECOVERY - pybal on lvs7003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:45:40] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573 (owner: 10Fabfur)
[17:45:56] <wikibugs>	 (03CR) 10Clément Goubert: "I've answered a few of your questions, some are already sort of addressed in PS8. I'll add more comments tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[17:49:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71228 and previous config saved to /var/cache/conftool/dbconfig/20241127-174911-ladsgroup.json
[17:49:17] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[17:52:03] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362844 (10Fabfur) lvs7003 has been restarted after cable swap, all fine
[17:52:09] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362843 (10Fabfur) Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098573 and ran puppet agent on `A:cp-magru`: NOOP as ex...
[17:54:49] <icinga-wm>	 RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1800)
[18:02:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:03:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 91.99% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:04:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71229 and previous config saved to /var/cache/conftool/dbconfig/20241127-180418-ladsgroup.json
[18:04:25] <icinga-wm>	 PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:05:25] <icinga-wm>	 RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 40 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[18:05:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye
[18:05:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye
[18:06:55] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362873 (10Fabfur) BGP flag enabled on NetBox for lvs700[1-3] and dns700[12]
[18:08:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 93.44% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[18:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:19:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71230 and previous config saved to /var/cache/conftool/dbconfig/20241127-181925-ladsgroup.json
[18:20:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian)
[18:30:08] <wikibugs>	 (03CR) 10Arlolra: [C:03+1] Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian)
[18:31:02] <wikibugs>	 (03CR) 10Arlolra: [C:03+1] "recheck" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian)
[18:34:33] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71231 and previous config saved to /var/cache/conftool/dbconfig/20241127-183432-ladsgroup.json
[18:34:35] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[18:34:38] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:34:49] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[18:34:56] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71232 and previous config saved to /var/cache/conftool/dbconfig/20241127-183455-ladsgroup.json
[18:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[18:36:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[18:37:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye
[18:37:20] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7001.wikimedia.org with reason: T380307
[18:37:22] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7001.wikimedia.org with reason: T380307
[18:37:24] <stashbot>	 T380307: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307
[18:37:57] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns7001.wikimedia.org
[18:37:57] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7001.wikimedia.org
[18:38:02] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org
[18:38:02] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org
[18:38:20] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7001.magru.wmnet
[18:38:20] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7001.magru.wmnet
[18:38:24] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7002.magru.wmnet
[18:38:24] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7002.magru.wmnet
[18:38:28] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7003.magru.wmnet
[18:38:29] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7003.magru.wmnet
[18:38:37] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for 16 hosts
[18:38:44] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 16 hosts
[18:40:06] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363009 (10Fabfur) Removed downtime from all lvs, dns and cp hosts in magru
[18:46:43] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035)
[18:47:04] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=magru
[18:49:41] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363039 (10Fabfur) Repooled dnsbox cluster and run authdns-update
[18:49:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71233 and previous config saved to /var/cache/conftool/dbconfig/20241127-184946-ladsgroup.json
[18:50:01] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[18:50:39] <wikibugs>	 (03CR) 10C. Scott Ananian: [C:03+1] Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra)
[18:52:05] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664)
[18:53:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra)
[18:54:53] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10363056 (10cmooney) So I've been meaning to look at this for ages and while how to generate the records were clear to me, how to update the existing [[ https://gerrit.wikimedia.org/...
[18:56:23] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[18:56:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:...
[19:00:57] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10363115 (10dcaro) A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone:  http://w...
[19:02:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:02:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye
[19:04:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71235 and previous config saved to /var/cache/conftool/dbconfig/20241127-190453-ladsgroup.json
[19:05:37] <mutante>	 jouncebot: nowandnext
[19:05:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 54 minute(s)
[19:05:38] <jouncebot>	 In 1 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2100)
[19:06:07] <mutante>	 I am going to deploy a change to puppet code that installs scap. Disabling puppet on "R:scap:target" for a few minutes.
[19:06:32] <mutante>	 but it's expected to be all noop on any existing scap::target
[19:06:55] <mutante>	 it's about fixing an issue on new hosts that get scap installed the first time
[19:09:11] <logmsgbot>	 !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: dc=magru,service=cdn
[19:09:28] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363181 (10Fabfur) ran puppet-agent on `A:magru`
[19:10:04] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche)
[19:12:15] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363189 (10Fabfur) Repooled all depooled cp hosts before repooling whole DC
[19:13:04] <mutante>	 !log disabled puppet on R:scap::target (180 hosts) for a short time - deploying gerrit:1092841
[19:13:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:19] <icinga-wm>	 PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 180187 MB (4% inode=92%): /srv/swift-storage/sdg1 205941 MB (5% inode=92%): /srv/swift-storage/sdc1 151439 MB (3% inode=90%): /srv/swift-storage/sdh1 167101 MB (4% inode=91%): /srv/swift-storage/sde1 178339 MB (4% inode=92%): /srv/swift-storage/sdd1 154510 MB (4% inode=91%): /srv/swift-storage/sdj1 175371 MB (4% inode=92%): /srv/swift-st
[19:14:19] <icinga-wm>	 k1 169838 MB (4% inode=92%): /srv/swift-storage/sdi1 169300 MB (4% inode=92%): /srv/swift-storage/sdl1 199607 MB (5% inode=92%): /srv/swift-storage/sdn1 186908 MB (4% inode=92%): /srv/swift-storage/sdm1 188628 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops
[19:14:39] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175)
[19:14:49] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:14:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[19:14:55] <logmsgbot>	 !log mforns@deploy2002 Started deploy [airflow-dags/analytics@99032bf]: regular weekly train
[19:15:34] <mutante>	 interesting, jouncebot claimed nothing is going to be deployed but also it's a regular weekly train
[19:16:27] <rzl>	 mutante: analytics train, not mw train :)
[19:16:49] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:17:01] <mutante>	 rzl: well, what I am really wondering is just if that uses scap
[19:17:51] <logmsgbot>	 !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@99032bf]: regular weekly train (duration: 03m 10s)
[19:18:03] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: repool magru, T376737]
[19:18:10] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: repool magru, T376737]
[19:18:14] <mutante>	 it's ok, if anything this is about seeing puppet errors, not scap deployment errors
[19:18:15] <rzl>	 mutante: for airflow it looks like yes
[19:18:30] <rzl>	 just reading https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances#analytics
[19:18:34] <mutante>	 but technically if I ask jouncebot I expected that to mean all
[19:18:37] <mutante>	 thanks rzl
[19:18:52] <brett>	 rzl, jhathaway: magru has been repooled now
[19:19:08] <rzl>	 brett: ack, thanks!
[19:20:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71236 and previous config saved to /var/cache/conftool/dbconfig/20241127-192000-ladsgroup.json
[19:21:47] <wikibugs>	 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10363222 (10Andrew) This has not recurred. Nevertheless we should figure out what's happening with the ruby functions that don't rai...
[19:21:56] <wikibugs>	 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363224 (10Fabfur) Repooled magru DC
[19:22:11] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#10363226 (10Volans) @JMeybohm that practically covers the current production use case, but is not future proof as it doesn't cover all the generic cases. Hence why I said I wa...
[19:22:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:22:56] <inflatador>	 I've got a brand-new R450 that gives the message "Unified Server Configurator does not support console redirection" when I try to connect to its console, has anyone seen that before?
[19:23:10] <inflatador>	 oops, meant to post that in dc ops
[19:23:28] <mutante>	 sounds like it did not get the BIOS settings to enable console
[19:23:53] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye
[19:24:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:...
[19:24:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:24:34] <mutante>	 I have re-enabled puppet on all the scap::target hosts. So far I see no issues EXCEPT on 2 phab hosts.
[19:24:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye
[19:24:45] <mutante>	 but I am watching puppetboard for any others.
[19:25:11] <mutante>	 if there is anything then it's a puppet dependency cycle
[19:25:49] <icinga-wm>	 PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:25:54] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye
[19:26:03] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors...
[19:27:49] <icinga-wm>	 RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:31:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance
[19:31:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance
[19:32:02] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71237 and previous config saved to /var/cache/conftool/dbconfig/20241127-193202-ladsgroup.json
[19:32:36] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1025.eqiad.wmnet with OS bullseye
[19:34:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:35:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71238 and previous config saved to /var/cache/conftool/dbconfig/20241127-193507-ladsgroup.json
[19:35:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[19:35:12] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:35:22] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[19:35:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71239 and previous config saved to /var/cache/conftool/dbconfig/20241127-193529-ladsgroup.json
[19:36:24] <moritzm>	 !log imported jenkins 2.479.2 to thirdparty/ci for bullseye-wikimedia
[19:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71240 and previous config saved to /var/cache/conftool/dbconfig/20241127-193858-ladsgroup.json
[19:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[19:40:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:...
[19:42:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye
[19:50:00] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1025.eqiad.wmnet with OS bullseye
[19:50:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:...
[19:50:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye
[19:50:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1026.eqiad.wmnet with OS bullseye
[19:51:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71241 and previous config saved to /var/cache/conftool/dbconfig/20241127-195129-ladsgroup.json
[19:51:34] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[19:54:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P71242 and previous config saved to /var/cache/conftool/dbconfig/20241127-195406-ladsgroup.json
[20:06:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71243 and previous config saved to /var/cache/conftool/dbconfig/20241127-200636-ladsgroup.json
[20:09:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P71244 and previous config saved to /var/cache/conftool/dbconfig/20241127-200913-ladsgroup.json
[20:18:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1026.eqiad.wmnet with reason: host reimage
[20:20:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1026.eqiad.wmnet with reason: host reimage
[20:21:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71245 and previous config saved to /var/cache/conftool/dbconfig/20241127-202143-ladsgroup.json
[20:24:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71246 and previous config saved to /var/cache/conftool/dbconfig/20241127-202420-ladsgroup.json
[20:24:26] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance
[20:24:40] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance
[20:24:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71247 and previous config saved to /var/cache/conftool/dbconfig/20241127-202446-ladsgroup.json
[20:31:43] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71248 and previous config saved to /var/cache/conftool/dbconfig/20241127-203143-ladsgroup.json
[20:35:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069)
[20:36:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71249 and previous config saved to /var/cache/conftool/dbconfig/20241127-203650-ladsgroup.json
[20:36:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[20:36:55] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:37:18] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[20:37:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069)
[20:37:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71250 and previous config saved to /var/cache/conftool/dbconfig/20241127-203724-ladsgroup.json
[20:38:18] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[20:39:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott)
[20:43:46] <wikibugs>	 (03CR) 10Jforrester: Add CodeMirror to BetaFeaturesAllowList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098161 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal)
[20:44:50] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2237 depool (T379813)', diff saved to https://phabricator.wikimedia.org/P71251 and previous config saved to /var/cache/conftool/dbconfig/20241127-204450-ladsgroup.json
[20:44:56] <stashbot>	 T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT  eu_aspect,eu_entity_id  FROM `wbc_entity - https://phabricator.wikimedia.org/T379813
[20:45:18] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Optimize (T379813)
[20:45:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Optimize (T379813)
[20:46:10] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:46:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P71252 and previous config saved to /var/cache/conftool/dbconfig/20241127-204650-ladsgroup.json
[20:52:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71253 and previous config saved to /var/cache/conftool/dbconfig/20241127-205238-ladsgroup.json
[20:52:41] <wikibugs>	 (03PS1) 10DDesouza: Reader Survey: Undeploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660)
[20:52:43] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[20:55:36] <wikibugs>	 (03PS1) 10Gergő Tisza: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646)
[20:56:05] <wikibugs>	 (03PS1) 10Gergő Tisza: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646)
[20:56:43] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[20:56:57] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2100).
[21:00:05] <jouncebot>	 arlolra, MatmaRex, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:28] <arlolra>	 o/
[21:00:57] <MatmaRex>	 hi
[21:01:04] <tgr|away>	 o/
[21:01:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P71254 and previous config saved to /var/cache/conftool/dbconfig/20241127-210157-ladsgroup.json
[21:04:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 93.06% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[21:05:04] <cjming>	 is a deployer needed?
[21:05:54] <cjming>	 or are folks in the queue able to self-deploy?
[21:06:59] <tgr|away>	 cjming: I can self-deploy (also deploy the rest if there's no one else but happy to leave that to you if you are willing)
[21:07:46] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71255 and previous config saved to /var/cache/conftool/dbconfig/20241127-210745-ladsgroup.json
[21:07:55] <wikibugs>	 (03PS1) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625
[21:08:26] <cjming>	 arlolra: do you need a deployer?
[21:08:33] <cjming>	 MatmaRex: same Q to you?
[21:08:44] <MatmaRex>	 yes please
[21:08:49] <arlolra>	 cjming: I don't have much experience deploying mediawiki but I'm in the deployment group
[21:09:22] <cjming>	 ok - how about this -- i'll do the ones for arlolra and MatmaRex and then pass to you tgr?
[21:09:46] <cjming>	 arlolra: can your backports go out together?
[21:09:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 93.06% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[21:10:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn)
[21:10:16] <cjming>	 arlolra: is order important?  can i merge your backports and do the config patch first?
[21:10:37] <arlolra>	 The order is important, the config patch should go last
[21:10:54] <arlolra>	 I can try to do mine if everyone has patience
[21:11:29] <arlolra>	 And is around to help pick up the pieces
[21:12:01] <cjming>	 arlolra: can your backports go out together?
[21:12:47] <arlolra>	 The 1.44.0-wmf.5 patches can go out together, yes
[21:13:11] <arlolra>	 But it might be better to do the revert
[21:13:13] <arlolra>	 Then the bumps
[21:13:16] <arlolra>	 Then the config
[21:13:29] <cjming>	 got it - ok - i'll do yours first then
[21:14:03] <arlolra>	 Thanks
[21:14:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian)
[21:14:59] <cjming>	 arlolra: bec backports take so long to merge - i'm going to go ahead and merge your bump patches now too
[21:15:19] <arlolra>	 Sounds good
[21:15:23] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra)
[21:15:31] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra)
[21:15:54] <wikibugs>	 (03PS2) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625
[21:16:18] <cscott>	 (I'm also here if needed!)
[21:16:41] <wikibugs>	 (03PS1) 10DDesouza: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660)
[21:16:44] * cjming thanks cscott
[21:17:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71256 and previous config saved to /var/cache/conftool/dbconfig/20241127-211704-ladsgroup.json
[21:18:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn)
[21:18:25] <cjming>	 tgr: i'll pass to you when i finish with arlolra's and MatmaRex's patches -- it will probably be at least 30 minutes from now
[21:18:42] <tgr|away>	 thx
[21:21:02] <cjming>	 MatmaRex: i'll manually merge your patch in about 5-10 minutes so it'll be ready for scap backport after the first bunch finish
[21:21:20] <MatmaRex>	 sure, thanks
[21:21:23] <wikibugs>	 (03PS3) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625
[21:22:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:22:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71257 and previous config saved to /var/cache/conftool/dbconfig/20241127-212252-ladsgroup.json
[21:24:48] <tgr|away>	 ugh
[21:24:49] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza)
[21:24:52] <tgr|away>	 why do we have a login.m.wikimedia.beta.wmflabs.org?
[21:25:31] <cscott>	 I think it needs some more domain components 
[21:25:54] <tgr|away>	 it shouldn't exist at all
[21:26:06] <tgr|away>	 login.m.wikimedia.org is a DNS lookup error
[21:26:34] <tgr|away>	 as in, there is intentionally no mobile variant
[21:26:49] <tgr|away>	 but apparently all the special-casing around that breaks on beta
[21:27:45] <mutante>	 can't have wildcard SSL certs for more that one level,afair
[21:28:36] <tgr|away>	 that URL structure is fine in general
[21:28:49] <tgr|away>	 the mobile version of en.wikipedia.org is en.m.wikipedia.org etc
[21:29:11] <tgr|away>	 en.wikipedia.beta.wmflabs.org / en.m.wikipedia.beta.wmflabs.org on beta
[21:29:20] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[21:29:48] <tgr|away>	 but we don't want a separate mobile domain for loginwiki since the entire point is having a single domain for central session cookies so we don't want to split that by device
[21:30:15] <tgr|away>	 but somehow on beta both MediaWiki and the DNS / routing infra think there is a separate mobile login domain
[21:30:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian)
[21:31:00] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra)
[21:31:05] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]]
[21:31:06] <cjming>	 arlolra: do you need to verify the revert or am i good to sync?
[21:31:10] <stashbot>	 T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977
[21:31:18] <arlolra>	 I can verify it
[21:31:28] <cjming>	 1 sec then
[21:32:02] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack nova policy: open a few more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1098628 (https://phabricator.wikimedia.org/T380069)
[21:32:37] <cscott>	 Arlo there's a url in phab and the slack thread
[21:33:10] <arlolra>	 Yup
[21:33:23] <cscott>	 https://he.wikipedia.org/w/index.php?title=%D7%95%D7%95%D7%90%D7%98%D7%A1%D7%90%D7%A4&uselang=en&useparsoid=1
[21:33:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] openstack nova policy: open a few more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1098628 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott)
[21:34:21] <cscott>	 (but I don't have x-wikimedia-debug on my phone browser so I can't check whether it's fixed)
[21:35:14] <arlolra>	 Surely you can send a header from your phone
[21:35:47] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra)
[21:35:48] <cjming>	 arlolra: please check revert - up on test servers
[21:35:56] <cscott>	 Just give me a telnet client to port 80
[21:36:02] <tgr|away>	 cscott: you can set a cookie on mobile
[21:36:27] <arlolra>	 cjming: Thanks, it's working as expected.  Please continue
[21:36:43] <tgr|away>	 just visit Special:WikimediaDebug
[21:36:53] <cjming>	 cool - syncing
[21:37:06] <logmsgbot>	 !log cjming@deploy2002 cjming, cscott: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:37:08] <logmsgbot>	 !log cjming@deploy2002 cjming, cscott: Continuing with sync
[21:37:08] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[21:37:08] <cscott>	 You're going to make me into one of those millennials who does all their hacking from their phone
[21:37:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1026.eqiad.wmnet with OS bullseye
[21:37:11] <stashbot>	 T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977
[21:37:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1026.eqiad.wmnet with OS bullseye completed: - wdqs1026...
[21:38:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71258 and previous config saved to /var/cache/conftool/dbconfig/20241127-213759-ladsgroup.json
[21:38:02] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[21:38:05] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:38:16] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[21:39:44] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński)
[21:40:22] <cscott>	 tgr|away: that's a good trick that I didn't know before
[21:40:35] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye
[21:40:41] <cscott>	 I too can now confirm that the canaries look good ;) a little slow
[21:40:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye
[21:40:55] <cjming>	 so slow
[21:40:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn)
[21:43:54] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]] (duration: 12m 49s)
[21:43:59] <stashbot>	 T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977
[21:45:06] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]]
[21:45:11] <stashbot>	 T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035
[21:45:11] <stashbot>	 T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664
[21:47:17] <tgr|away>	 apparently production MediaWiki also thinks login.m.wikimedia.org exists
[21:47:24] <tgr|away>	 how does this not break everything?
[21:47:54] <cscott>	 Now that you've observed it, it will undoubtedly start breaking everything.
[21:48:14] <tgr|away>	 cjming: I'll retract the backports, need to fix mobile domain configuration first
[21:48:25] <cjming>	 tgr: sounds good
[21:52:40] <cjming>	 oof - not sure what's happening - presumably scap is still doing it's thing - could it be stuck?
[21:53:03] <mutante>	 also see https://phabricator.wikimedia.org/T152882 for "misc wikis lack mobile domains"
[21:53:25] <wikibugs>	 (03PS1) 10Gergő Tisza: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633
[21:53:37] <arlolra>	 :/
[21:53:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[21:53:54] <tgr|away>	 (unless someone is willing to +1 ^^ and then I can deploy it in the window)
[21:54:00] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[21:54:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71259 and previous config saved to /var/cache/conftool/dbconfig/20241127-215407-ladsgroup.json
[21:54:12] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[21:55:10] <cjming>	 any SREs around to confirm all is ok?  it seems to be stuck at: https://www.irccloud.com/pastebin/5ZI3MnEf/
[21:55:25] <tgr|away>	 cjming: 10 minutes doesn't seem that extreme
[21:55:32] <cjming>	 really?
[21:56:00] <cjming>	 i've been trying to cultivate more patience
[21:56:08] <tgr|away>	 well if it's stuck at building the images then maybe yes
[21:56:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.294s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:56:21] <tgr|away>	 I have seen the sync part take a while
[21:56:56] <cjming>	 in my experience it's CI that takes the most time -- scap has been pretty zippy lately imho
[21:57:16] <tgr|away>	 maybe that logfile contains something useful?
[21:59:15] <subbu>	 from what i remember hearing, vendor syncs take a while whereas config / single-file syncs are zippy
[21:59:28] <cjming>	 finally started going again -- hopefully all good
[21:59:52] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza)
[22:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2200)
[22:00:05] <rzl>	 received a VO magru page but I think it's just a 24-hour repage from yesterday
[22:00:11] <rzl>	 (cc jhathaway)
[22:00:26] <rzl>	 yeah it is, marking it resolved
[22:01:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.294s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[22:01:19] <cjming>	 subbu: gtk - thanks
[22:02:06] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:02:30] <jhathaway>	 thanks @rzl
[22:02:49] <jhathaway>	 damn that muscle memory :(
[22:04:30] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza)
[22:04:31] <rzl>	 jhathaway: paging you at 3:59 when the long weekend starts at 4:00 is admittedly pretty funny :) see you, enjoy
[22:04:55] <jhathaway>	 :), thanks!
[22:06:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71260 and previous config saved to /var/cache/conftool/dbconfig/20241127-220638-ladsgroup.json
[22:06:44] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:06:59] <wikibugs>	 (03PS2) 10Cathal Mooney: Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501)
[22:07:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[22:07:46] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:07:59] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage
[22:08:46] <wikibugs>	 (03PS3) 10Cathal Mooney: Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501)
[22:08:48] <cjming>	 arlolra: bump patches should be up on test servers
[22:09:06] <logmsgbot>	 !log cjming@deploy2002 arlolra, cjming: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:09:11] <stashbot>	 T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035
[22:09:12] <stashbot>	 T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664
[22:09:26] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:09:29] <arlolra>	 cjming: Testing
[22:09:38] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:09:41] <MatmaRex>	 (i'm still around, if we reach my backport patch today)
[22:11:03] <logmsgbot>	 !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage
[22:11:25] <cjming>	 MatmaRex: your patch might be too i think
[22:11:47] <MatmaRex>	 oh. lemme test
[22:12:07] <tgr|away>	 I wonder if it made sense to adopt deploying code in batches? it's basically impossible to fit 5-6 patches in a backport window since we switched to k8s / full scap
[22:12:21] <cjming>	 ^^^ agree
[22:12:22] <MatmaRex>	 cjming: yep, my thing looks fixed on mwdebug
[22:13:25] <cjming>	 cool - i think i still need to scap backport yours separately? anyway, i can do that after current batch is done
[22:13:40] <arlolra>	 cjming: please continue
[22:13:44] <cjming>	 phew!
[22:13:46] <logmsgbot>	 !log cjming@deploy2002 arlolra, cjming: Continuing with sync
[22:13:58] <tgr|away>	 if it's on the testservers, it will be in production as well
[22:14:25] <tgr|away>	 if scap complained about an unexpected patch and showed you the diff, it's going to be deployed
[22:14:47] <cjming>	 got it - so no need to scap backport manually merged patches?
[22:16:07] <tgr|away>	 scap backport just merges the patch and then does a git pull and a full scap (and rebase security patches and other fine details like that)
[22:16:28] <tgr|away>	 so a merge and then it syncs out the git heads to the servers, basically
[22:20:17] <cjming>	 when there are several backports in a queue, i end up manually merging stuff to get ahead of CI but i get concerned if anything ever needs reverting 
[22:21:11] <cjming>	 MatmaRex: then yours should be live along with arlolra's backports - hopefully soonish
[22:21:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P71261 and previous config saved to /var/cache/conftool/dbconfig/20241127-222145-ladsgroup.json
[22:22:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:22:28] <wikibugs>	 (03CR) 10Gergő Tisza: "@nshahquinn@wikimedia.org FYI (since the comment says to notify you)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza)
[22:26:20] <logmsgbot>	 !log bking@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1002"
[22:27:44] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]] (duration: 42m 38s)
[22:27:49] <stashbot>	 T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035
[22:27:50] <stashbot>	 T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664
[22:27:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian)
[22:28:10] <cjming>	 arlolra: revert + bumps should be live - doing your config patch no
[22:28:14] <cjming>	 *now
[22:28:29] <cjming>	 MatmaRex: yours should be live too
[22:28:31] <cscott>	 Yay
[22:28:36] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian)
[22:28:38] <MatmaRex>	 yep, i see it. thanks cjming
[22:28:45] <arlolra>	 cjming: thanks
[22:29:03] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]]
[22:29:08] <stashbot>	 T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769
[22:29:21] <cscott>	 If this all goes off smoothly, we should have a party in two weeks to celebrate.  Maybe wmf will fly us all to Barcelona for it.
[22:29:37] <cjming>	 lol
[22:30:28] <subbu>	 arlolra has to step out .. so I am around to verify the last bit ... parsoid read view on jawikivoyage.
[22:30:46] <cjming>	 cool - should be ready shortly
[22:30:49] <cscott>	 I can also verify from my phone now, thanks to tgr
[22:30:52] <subbu>	 but looks like cscott also has power to be a cool mobile-testing kid.
[22:31:06] <subbu>	 and he can now read my mind too it seems with his new found powers.
[22:31:24] <cscott>	 I'm mostly here as a distraction apparently
[22:31:26] <arlolra>	 Sorry, yes, I have to make it to the pharmacy before it closes but subbu and cscott are here
[22:31:31] <cjming>	 np!
[22:32:07] <arlolra>	 cjming: thanks for the help and my resolution will be doing my own deploys in the new year
[22:32:48] <cjming>	 arlolra: yw! good resolution :)
[22:34:11] <cscott>	 I'll queue up ja.wikivoyage on my phone while I'm waiting
[22:34:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:35:04] <cjming>	 cscott, subbu: config patch should be up on test servers
[22:35:04] <subbu>	 cjming, looks like this is already on the test servers. I can see that it rolled out there
[22:35:12] <cjming>	 yay!
[22:35:20] <cjming>	 so gtg?
[22:35:21] <subbu>	 ok, so yes verified that it is rendering properly there with parsoid.
[22:35:25] <subbu>	 yes.
[22:35:30] <cjming>	 nice
[22:35:40] <logmsgbot>	 !log cjming@deploy2002 cscott, cjming: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:35:44] <logmsgbot>	 !log cjming@deploy2002 cscott, cjming: Continuing with sync
[22:35:45] <stashbot>	 T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769
[22:36:40] <cscott>	 Hey I can confirm it works 
[22:36:47] <cjming>	 woohoo!
[22:36:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[22:36:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P71262 and previous config saved to /var/cache/conftool/dbconfig/20241127-223652-ladsgroup.json
[22:36:53] <subbu>	 thanks!
[22:39:26] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:42:40] <cscott>	 I updated the village pump notices to let jawikivoyage know that we were able to squeeze in their deploy this week.
[22:43:38] <wikibugs>	 (03PS2) 10Gergő Tisza: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646)
[22:44:26] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]] (duration: 15m 22s)
[22:44:30] <cjming>	 subbu, cscott: should be live!
[22:44:30] <stashbot>	 T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769
[22:45:36] <subbu>	 \o/
[22:46:50] <cjming>	 !log end of UTC late backport window
[22:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:17] <tgr|away>	 I'll deploy a config fix
[22:47:51] <cjming>	 oh - sorry - prematurely closed the window
[22:48:48] <cscott>	 Tgr needs to defenestrate something still.
[22:49:17] <subbu>	 cscott, i am signing off ... back online in 3 hours if there is anything needed.
[22:49:34] <cscott>	 👍
[22:49:35] <subbu>	 but available on signal.
[22:49:45] <cscott>	 Thanks again cjming 
[22:49:58] <cjming>	 ur welcome :)
[22:50:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[22:50:47] <wikibugs>	 (03Merged) 10jenkins-bot: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza)
[22:51:16] <logmsgbot>	 !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]]
[22:51:21] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[22:51:59] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71263 and previous config saved to /var/cache/conftool/dbconfig/20241127-225159-ladsgroup.json
[22:52:01] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[22:52:04] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[22:52:15] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[22:56:47] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:56:51] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[23:01:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[23:01:52] <wikibugs>	 (03PS2) 10DDesouza: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660)
[23:01:52] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance
[23:02:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71264 and previous config saved to /var/cache/conftool/dbconfig/20241127-230159-ladsgroup.json
[23:02:04] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:02:27] <logmsgbot>	 !log tgr@deploy2002 tgr: Continuing with sync
[23:09:24] <logmsgbot>	 !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]] (duration: 18m 07s)
[23:09:26] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:09:28] <stashbot>	 T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646
[23:15:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71267 and previous config saved to /var/cache/conftool/dbconfig/20241127-231504-ladsgroup.json
[23:15:09] <stashbot>	 T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903
[23:30:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P71269 and previous config saved to /var/cache/conftool/dbconfig/20241127-233011-ladsgroup.json
[23:39:39] <jinxer-wm>	 FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[23:40:42] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:45:18] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P71270 and previous config saved to /var/cache/conftool/dbconfig/20241127-234518-ladsgroup.json
[23:51:44] <wikibugs>	 (03PS3) 10Tim Starling: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113)
[23:53:07] <wikibugs>	 (03CR) 10Tim Starling: Move default main page text for new wikis to config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)