[00:03:10] hi, just wanting to attract attention to T380729 [00:03:11] T380729: 2024-11-20 dump run appears stuck - https://phabricator.wikimedia.org/T380729 [00:03:12] in a few days it will be time for the 2024-12-01 dump to begin, so it would be good if the relevant people could investigate this sooner rather than later [00:03:15] thanks! [00:28:03] Reedy: hah! nice find [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221 [00:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221 (owner: 10TrainBranchBot) [01:01:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1098221 (owner: 10TrainBranchBot) [01:08:25] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231 [01:08:25] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231 (owner: 10TrainBranchBot) [01:30:07] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1098231 (owner: 10TrainBranchBot) [01:39:23] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10360298 (10Platonides) It does look good so far. While it was losing messages at *:40-*:45, in the last two days it has only lost at 11:45-11... [01:57:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [02:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:04:11] FIRING: [13x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:07:06] FIRING: [13x] ProbeDown: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:00:28] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 626.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:52:28] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:26:36] PROBLEM - snapshot of x1 in codfw on backupmon1001 is CRITICAL: Last snapshot for x1 at codfw (db2197) taken on 2024-11-27 04:55:03 is 360 GiB, but the previous one was 531 GiB, a change of -32.2 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:50:02] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:57:31] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [05:58:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [06:02:35] (03PS2) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) [06:03:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian) [06:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:48] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:18:02] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Idle - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0700) [07:00:53] (03PS1) 10Abijeet Patro: ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) [07:00:56] (03PS1) 10Kevin Bazira: ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) [07:01:09] (03PS1) 10Abijeet Patro: Fix illegal access of typed property. [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) [07:03:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro) [07:04:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro) [07:07:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:17:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:22:56] RESOLVED: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:24:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:28:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:33:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:34:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7002.magru.wmnet with OS bookworm [07:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:52:22] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm) [07:57:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [08:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0800). Please do the needful. [08:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:01:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [08:02:13] abijeet: around? [08:02:44] hello [08:04:19] abijeet: can these patches be tested separatly or depends on each other? [08:04:21] (03PS1) 10Muehlenhoff: Add small helper script for checking firewall config for nftables and ferm [puppet] - 10https://gerrit.wikimedia.org/r/1098465 [08:04:36] kart_, they can be tested separately [08:05:01] I'll start with first patch and once we start deployment, will +2 on 2nd patch. [08:05:08] ok [08:05:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro) [08:05:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:31] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 174128 MB (4% inode=91%): /srv/swift-storage/sdc1 148931 MB (3% inode=90%): /srv/swift-storage/sdf1 178075 MB (4% inode=91%): /srv/swift-storage/sdd1 180759 MB (4% inode=91%): /srv/swift-storage/sdg1 188674 MB (4% inode=91%): /srv/swift-storage/sdh1 176265 MB (4% inode=92%): /srv/swift-storage/sdi1 208144 MB (5% inode=92%): /srv/swift-st [08:07:31] j1 171786 MB (4% inode=92%): /srv/swift-storage/sdk1 170758 MB (4% inode=92%): /srv/swift-storage/sdm1 170428 MB (4% inode=92%): /srv/swift-storage/sdn1 180469 MB (4% inode=92%): /srv/swift-storage/sdl1 159027 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [08:09:39] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 171269 MB (4% inode=92%): /srv/swift-storage/sdc1 170580 MB (4% inode=91%): /srv/swift-storage/sdf1 201463 MB (5% inode=92%): /srv/swift-storage/sdg1 194113 MB (5% inode=92%): /srv/swift-storage/sdd1 171242 MB (4% inode=91%): /srv/swift-storage/sde1 185978 MB (4% inode=92%): /srv/swift-storage/sdi1 177540 MB (4% inode=91%): /srv/swift-st [08:09:39] k1 175534 MB (4% inode=92%): /srv/swift-storage/sdj1 168947 MB (4% inode=91%): /srv/swift-storage/sdl1 170915 MB (4% inode=92%): /srv/swift-storage/sdm1 180730 MB (4% inode=91%): /srv/swift-storage/sdn1 146863 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [08:09:54] (03CR) 10Muehlenhoff: [C:03+2] Add small helper script for checking firewall config for nftables and ferm [puppet] - 10https://gerrit.wikimedia.org/r/1098465 (owner: 10Muehlenhoff) [08:14:19] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 192288 MB (5% inode=92%): /srv/swift-storage/sdg1 182702 MB (4% inode=91%): /srv/swift-storage/sdc1 186648 MB (4% inode=92%): /srv/swift-storage/sdi1 163047 MB (4% inode=91%): /srv/swift-storage/sde1 158899 MB (4% inode=91%): /srv/swift-storage/sdh1 168798 MB (4% inode=91%): /srv/swift-storage/sdj1 205095 MB (5% inode=92%): /srv/swift-st [08:14:19] k1 193241 MB (5% inode=92%): /srv/swift-storage/sdd1 150114 MB (3% inode=90%): /srv/swift-storage/sdm1 176126 MB (4% inode=92%): /srv/swift-storage/sdl1 163831 MB (4% inode=91%): /srv/swift-storage/sdn1 149088 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [08:15:56] (03Merged) 10jenkins-bot: Fix illegal access of typed property. [extensions/TranslationNotifications] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098415 (https://phabricator.wikimedia.org/T380724) (owner: 10Abijeet Patro) [08:17:03] (03PS1) 10Muehlenhoff: ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 [08:17:16] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]] [08:17:21] T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724 [08:17:39] (03CR) 10CI reject: [V:04-1] ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 (owner: 10Muehlenhoff) [08:18:32] (03PS2) 10Muehlenhoff: ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 [08:19:26] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:22:28] (03Abandoned) 10David Caro: grid: disable hardcoded memory overcmommit on weblight [puppet] - 10https://gerrit.wikimedia.org/r/983139 (owner: 10David Caro) [08:22:39] (03CR) 10Vgutierrez: [C:03+1] benthos: add benthos for haproxy debug functions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:23:25] (03CR) 10Muehlenhoff: [C:03+2] ganeti: Add missing file [puppet] - 10https://gerrit.wikimedia.org/r/1098467 (owner: 10Muehlenhoff) [08:23:45] FIRING: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:23:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [08:23:51] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:24:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [08:24:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7002.magru.wmnet with OS bookworm [08:24:49] (03CR) 10Vgutierrez: [C:04-1] hiera: add log ring to cp4039 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:24:50] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:24:54] T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724 [08:25:04] abijeet: Please test. [08:25:22] kart_, ok [08:27:09] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:29:17] abijeet: should we +2 2nd patch? [08:30:52] kart_, looks good, we can proceed [08:31:08] and yea, we can +2 2nd patch [08:31:34] nice! [08:31:39] !log kartik@deploy2002 kartik, abi: Continuing with sync [08:32:13] (03CR) 10KartikMistry: [C:03+2] ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro) [08:35:40] (03CR) 10Fabfur: hiera: add log ring to cp4039 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:36:09] (03PS5) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [08:37:39] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Readd firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1098468 [08:38:19] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098415|Fix illegal access of typed property. (T380724)]] (duration: 21m 02s) [08:38:23] T380724: Error: Typed property MediaWiki\Extension\TranslationNotifications\Jobs\GenericTranslationNotificationsJob::$logger must not be accessed before initialization - https://phabricator.wikimedia.org/T380724 [08:39:11] abijeet: going with 2nd patch.. [08:39:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro) [08:42:18] kart_, ok [08:43:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in esams - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [08:45:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm [08:49:08] (03PS8) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [08:49:28] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [08:49:35] (03Merged) 10jenkins-bot: ext.uls.inputsettings: Use arrow functions [extensions/UniversalLanguageSelector] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098413 (https://phabricator.wikimedia.org/T380431) (owner: 10Abijeet Patro) [08:49:38] (03PS6) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [08:50:02] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]] [08:50:07] T380431: TypeError: this.markDirty is not a function - https://phabricator.wikimedia.org/T380431 [08:51:27] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10360573 (10elukey) 05Open→03Declined The K8s SIG reviewed this proposal and for the moment it was decided not to proceed with anything... [08:55:42] !log kartik@deploy2002 abi, kartik: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:55:47] T380431: TypeError: this.markDirty is not a function - https://phabricator.wikimedia.org/T380431 [08:56:41] abijeet: you can test the patch on mwdebug [08:56:50] kart_, ok [08:57:39] (03PS1) 10Muehlenhoff: Readd Ganeti role to ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098469 (https://phabricator.wikimedia.org/T376737) [08:57:56] kart_, works fine [08:59:06] cool. going ahead! [08:59:12] !log kartik@deploy2002 abi, kartik: Continuing with sync [08:59:15] 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10360585 (10kostajh) AIUI the mechansi... [08:59:45] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360586 (10elukey) I tried to dowload and install perccli == `007.2616.0000.0000` on ms-be2081 but no luck, same... [09:00:05] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T0900) [09:02:18] (03PS1) 10Slyngshede: Show CN as signed in username [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) [09:03:16] hashar, andre I'm finishing deployment, will take few minutes.. [09:03:39] o/ [09:03:42] yeah take your time [09:03:52] I am going to brew a coffee and watch the error log [09:04:15] (03CR) 10Vgutierrez: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:04:22] (03CR) 10DCausse: [C:03+1] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking) [09:05:19] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [09:06:09] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098413|ext.uls.inputsettings: Use arrow functions (T380431)]] (duration: 16m 06s) [09:06:21] T380431: TypeError: this.markDirty is not a function - https://phabricator.wikimedia.org/T380431 [09:06:48] hashar: done. Have a nice day & :coffee [09:07:07] abijeet: we are done! [09:07:21] that was fast [09:07:25] (kind of) [09:08:50] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360601 (10elukey) I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, b... [09:09:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [09:10:46] kart_, thanks [09:13:55] (03PS9) 10Fabfur: benthos: add benthos for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) [09:14:12] (03CR) 10Fabfur: benthos: add benthos for haproxy debug functions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:14:20] (03PS7) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [09:19:07] logs look good, I am processing [09:19:17] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [09:19:44] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664) [09:19:45] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [09:20:31] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098471 (https://phabricator.wikimedia.org/T375664) (owner: 10TrainBranchBot) [09:22:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10360605 (10elukey) 05Resolved→03Open @Jhancock.wm hi! We have done a lot of weird tests with these nodes, I think that we should re-run provision for... [09:26:53] ok so httpbb failed again [09:29:44] * hashar files a task [09:29:57] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360614 (10MoritzMuehlenhoff) >>! In T377853#10360612, @MoritzMuehlenhoff wrote: > There are debs available in th... [09:30:13] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360612 (10MoritzMuehlenhoff) There are debs available in the Thomas Krenn repo (German server vendor): https://w... [09:30:29] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be10{83-91} - https://phabricator.wikimedia.org/T371389#10360610 (10elukey) @VRiley-WMF @Jclark-ctr Hi! We are ready to start provisioning these nodes, but the procedure is a little bit more convoluted than th... [09:30:50] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [09:31:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [09:31:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm [09:36:14] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10360637 (10MoritzMuehlenhoff) One other option is to try https://github.com/namiltd/megactl with this controller.... [09:39:35] 06SRE, 10Deployments, 06serviceops-radar: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment - https://phabricator.wikimedia.org/T364880#10360638 (10hashar) 05Open→03Resolved a:03RLazarus I am marking this one resolved since the confusing https://totoro.wikimedia.org. URL... [09:44:36] (03CR) 10Muehlenhoff: [C:03+2] Readd Ganeti role to ganeti7001/7002 [puppet] - 10https://gerrit.wikimedia.org/r/1098469 (https://phabricator.wikimedia.org/T376737) (owner: 10Muehlenhoff) [09:44:59] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.5 refs T375664 [09:45:10] T375664: 1.44.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T375664 [09:45:38] -15 [09:45:45] nope :D [09:46:41] !log fabfur@cumin1002 START - Cookbook sre.hosts.provision for host cp7006.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:48:00] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "readded ganeti nodes in magru - jmm@cumin2002 - T376737" [09:48:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "readded ganeti nodes in magru - jmm@cumin2002 - T376737" [09:49:56] PROBLEM - Host cp7006 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:08] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7006.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [09:56:00] RECOVERY - Host cp7006 is UP: PING OK - Packet loss = 0%, RTA = 114.95 ms [09:57:46] FIRING: Not accepting/receiving prefixes from anycast BGP peer: Alert for device asw1-b4-magru.mgmt.magru.wmnet - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [09:58:27] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7001 [09:59:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7001 [09:59:46] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti7002 [10:00:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti7002 [10:01:16] !log fabfur@cumin1002 START - Cookbook sre.hosts.provision for host cp7008.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:04:30] PROBLEM - Host cp7008 is DOWN: PING CRITICAL - Packet loss = 100% [10:08:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet [10:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:10:19] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp7008.mgmt.magru.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [10:10:40] RECOVERY - Host cp7008 is UP: PING OK - Packet loss = 0%, RTA = 115.07 ms [10:14:49] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:15:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:18:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet [10:19:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [10:21:22] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.addnode: Readd firewall check [cookbooks] - 10https://gerrit.wikimedia.org/r/1098468 (owner: 10Muehlenhoff) [10:28:30] (03CR) 10Gehel: [C:03+1] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking) [10:29:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [10:35:54] (03PS1) 10Klausman: ml-lab: Allow users to run nvtop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478 [10:36:42] PROBLEM - snapshot of x1 in eqiad on backupmon1001 is CRITICAL: Last snapshot for x1 at eqiad (db1216) taken on 2024-11-27 10:07:57 is 325 GiB, but the previous one was 469 GiB, a change of -30.6 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [10:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:36:49] (03PS2) 10Klausman: ml-lab: Allow users to run nvtop and radeontop via sudo [puppet] - 10https://gerrit.wikimedia.org/r/1098478 [10:39:04] (03CR) 10Kevin Bazira: [C:03+2] ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:39:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:40:29] (03Merged) 10jenkins-bot: ml-services: deploy article-country to the article-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098414 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira) [10:44:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [10:44:40] (03PS1) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [10:44:40] (03CR) 10Arnaudb: "x and m sections are excluded from this alert" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [10:46:44] (03PS8) 10Fabfur: hiera: add log ring to cp4039 [puppet] - 10https://gerrit.wikimedia.org/r/1097416 (https://phabricator.wikimedia.org/T329332) [10:54:48] (03PS1) 10Máté Szabó: Add HTTP proxy for IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) [10:56:21] (03CR) 10Kosta Harlan: Add HTTP proxy for IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [10:57:07] 06SRE, 10Incident-Reporting-System (Pilot wiki release December 2024), 13Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Gong (November 18 - December 6)): Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908#10360873 (10mszab... [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1100) [11:02:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru01 and group B3 [11:03:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru01 and group B3 [11:04:29] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [11:05:36] (03PS2) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) [11:05:40] (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [11:07:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:09] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs7001.magru.wmnet with reason: T376737 [11:13:23] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs7001.magru.wmnet with reason: T376737 [11:13:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7002.magru.wmnet to cluster magru02 and group B4 [11:14:51] (03CR) 10Kosta Harlan: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [11:15:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7002.magru.wmnet to cluster magru02 and group B4 [11:16:02] !log T380875 Ran mwscript-k8s --comment="T380875" -f -- extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'EMBakeryEquipment' 'Janapanna' [11:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:10] T380875: Unblock stuck global rename of Janapanna - https://phabricator.wikimedia.org/T380875 [11:16:24] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:16:57] FIRING: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:17:12] !incidents [11:17:12] 5480 (ACKED) PyBalBGPUnstable lvs sre (lvs7003:9090 pybal 64600 10.140.0.1 magru) [11:17:13] 5482 (UNACKED) [10x] ProbeDown sre (probes/service magru) [11:17:13] 5478 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [11:17:13] 5477 (RESOLVED) [10x] ProbeDown sre (probes/service magru) [11:17:13] 5475 (RESOLVED) [7x] ProbeDown sre (probes/service magru) [11:17:14] hello to you to ncredit [11:17:19] too* [11:17:21] !ack 5482 [11:17:21] 5482 (ACKED) [10x] ProbeDown sre (probes/service magru) [11:17:31] ty :) [11:17:49] dowtime it again [11:17:50] fabfur: could we get magru alerts silenced accordingly please? [11:17:52] thx [11:18:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to drbd [11:19:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T376737 [11:19:30] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T376737 [11:19:43] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs[7001-7003].magru.wmnet with reason: T376737 [11:19:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs[7001-7003].magru.wmnet with reason: T376737 [11:20:24] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:20:56] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7001.wikimedia.org with reason: T376737 [11:21:10] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7001.wikimedia.org with reason: T376737 [11:21:26] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7002.wikimedia.org with reason: T376737 [11:21:39] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7002.wikimedia.org with reason: T376737 [11:21:57] RESOLVED: [10x] ProbeDown: Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:06] FIRING: [22x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:26:42] FIRING: JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:27:32] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 175438 MB (4% inode=91%): /srv/swift-storage/sdc1 152201 MB (3% inode=90%): /srv/swift-storage/sdf1 179419 MB (4% inode=91%): /srv/swift-storage/sdd1 181544 MB (4% inode=91%): /srv/swift-storage/sdg1 184842 MB (4% inode=91%): /srv/swift-storage/sdh1 175138 MB (4% inode=92%): /srv/swift-storage/sdi1 203095 MB (5% inode=92%): /srv/swift-st [11:27:32] j1 173760 MB (4% inode=92%): /srv/swift-storage/sdk1 167809 MB (4% inode=92%): /srv/swift-storage/sdm1 169579 MB (4% inode=92%): /srv/swift-storage/sdn1 185368 MB (4% inode=92%): /srv/swift-storage/sdl1 157703 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [11:28:17] (03PS1) 10Ladsgroup: Bump ratio of new parsercache key spec to 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037) [11:29:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to drbd [11:29:34] PROBLEM - Host ncredir7002 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:38] RECOVERY - Host ncredir7002 is UP: PING OK - Packet loss = 0%, RTA = 115.46 ms [11:30:01] jouncebot: nowandnext [11:30:02] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1100) [11:30:02] In 0 hour(s) and 29 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1200) [11:31:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:31:42] RESOLVED: JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:31:52] (03Merged) 10jenkins-bot: Bump ratio of new parsercache key spec to 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098484 (https://phabricator.wikimedia.org/T373037) (owner: 10Ladsgroup) [11:32:20] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]] [11:32:26] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:34:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to drbd [11:34:40] (03CR) 10Tiziano Fogli: [C:03+1] "Just a few tips for improved readability." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [11:38:11] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:38:15] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:38:32] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:39:39] (03PS1) 10AOkoth: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) [11:41:03] (03PS3) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) [11:41:09] (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [11:43:07] (03CR) 10Kosta Harlan: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [11:45:11] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098484|Bump ratio of new parsercache key spec to 3 (T373037)]] (duration: 12m 51s) [11:45:16] T373037: Make ParserCache more like a ring - https://phabricator.wikimedia.org/T373037 [11:49:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.62s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:49:19] (03PS1) 10Muehlenhoff: miscweb: support os-reports deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [11:50:22] (03PS4) 10Máté Szabó: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) [11:50:29] (03CR) 10Máté Szabó: Configure IRS Zendesk integration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [11:53:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to drbd [11:54:14] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:14] PROBLEM - Host bast7001 is DOWN: PING CRITICAL - Packet loss = 100% [11:54:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.62s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:54:28] RECOVERY - Host bast7001 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms [11:54:43] (03PS1) 10KartikMistry: Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966) [11:54:47] ganeti2042 is expected? [11:57:18] (03PS1) 10AOkoth: mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045) [11:58:31] (03PS1) 10Mvolz: Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460) [11:58:42] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [11:59:41] OK to deploy cxserver? [11:59:43] (03PS1) 10Cathal Mooney: Change IP for lvs7003 on public1-b3-magru to 195.200.68.5/27 [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737) [11:59:54] moritzm: did ganeti2042 just crashed? [12:00:05] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1200) [12:00:10] (03PS2) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [12:01:00] (03CR) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:01:11] vgutierrez: expired downtime [12:01:17] I'll extend it [12:01:23] oh ok [12:01:41] the server is freshly procured from Supermicro, but has a broken CPU and DC ops are figuring out the process to get the part replaced [12:01:49] (03PS3) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [12:01:56] kart_: yep [12:03:02] (03CR) 10CI reject: [V:04-1] mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:03:21] (03CR) 10Mvolz: [C:03+2] Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460) (owner: 10Mvolz) [12:03:26] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on ganeti2042.codfw.wmnet with reason: broken CPU [12:03:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on ganeti2042.codfw.wmnet with reason: broken CPU [12:04:23] (03Merged) 10jenkins-bot: Update zotero package_lock and translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098490 (https://phabricator.wikimedia.org/T378460) (owner: 10Mvolz) [12:04:56] (03CR) 10Muehlenhoff: [C:03+1] "Patch looks good to me. The established process for changing sudo permissions to have them discussed in the weekly SRE IF meeting, I've ad" [puppet] - 10https://gerrit.wikimedia.org/r/1098478 (owner: 10Klausman) [12:05:00] claime: Thanks [12:05:07] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:05:10] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:05:24] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:05:27] ah. Forgot to merge the patch ;) [12:05:45] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:06:12] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [12:06:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to drbd [12:06:24] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [12:06:44] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:07:02] (03PS17) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [12:07:33] (03Merged) 10jenkins-bot: Update cxserver to 2024-11-20-121713-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098488 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [12:07:42] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:07:46] (03CR) 10Hnowlan: mediawiki: add mercurius features (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:08:20] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:09:41] (03PS4) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [12:11:02] (03CR) 10Arnaudb: "I've kept the annotations on the critical threshold, they have different summaries (90% vs 99%). Please lmk if its not ok!" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:11:50] (03PS5) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [12:12:19] !log installing openssl security updates [12:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot) [12:13:39] (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838) [12:13:45] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:13:56] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot) [12:13:56] (03CR) 10Effie Mouzeli: "Since we will be using the mesh, shall we enable it in the fixtures?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:14:09] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:16:03] (03CR) 10Clément Goubert: "I did, then forgot to upload the PS. Incoming." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:18:46] !log installing python-cryptography security updates [12:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:07] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:20:35] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:22:19] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:22:53] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:24:17] !log Updated cxserver to 2024-11-20-121713-production (T377966, T357950) [12:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:22] T377966: cxserver: Logstash entries seems difficult to read - https://phabricator.wikimedia.org/T377966 [12:24:22] T357950: Remove servicerunner dependency for cxserver - https://phabricator.wikimedia.org/T357950 [12:24:27] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:24:51] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:26:06] (03CR) 10Ssingh: [C:03+1] "nice and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney) [12:26:24] (03CR) 10Marostegui: "This needs more thinking, especially the alerting. Please do not merge yet." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:26:38] PROBLEM - Host prometheus7001 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:06] (03PS7) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [12:31:21] (03PS3) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) [12:31:47] (03PS6) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [12:32:01] (03CR) 10Arnaudb: "- I've established a more descriptive baseline" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:32:30] (03CR) 10Anzx: [C:03+1] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [12:32:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [12:34:18] (03CR) 10Anzx: [C:04-1] "there are some unrelated changes to azwikiqoute, you need to fix it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [12:34:20] (03CR) 10Arnaudb: "* I've established a more descriptive baseline → misphrasing went through:" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:35:38] (03CR) 10Marostegui: "We should make this a conditional and only start sending warnings (I don't think we need a critical) if the Uptime is higher than XX (to b" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:35:53] (03CR) 10Cathal Mooney: [C:03+2] Change IP for lvs7003 on public1-b3-magru to 195.200.68.5/27 [puppet] - 10https://gerrit.wikimedia.org/r/1098491 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney) [12:36:05] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry) [12:36:38] (03CR) 10Arnaudb: "good idea, I'll go in that direction, transitioning to WIP in the meantime" [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [12:37:08] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-27-065850-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098495 (https://phabricator.wikimedia.org/T380838) (owner: 10KartikMistry) [12:38:13] !log start replacing kafka-main1002 with kafka-main1007 - T363214 [12:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:17] T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214 [12:39:11] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:47:34] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) [12:48:13] (03CR) 10CI reject: [V:04-1] openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez) [12:48:41] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) [12:49:44] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez) [12:49:49] (03PS1) 10Hnowlan: jobqueue: disable webVideoTranscodePrioritized [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098499 (https://phabricator.wikimedia.org/T371701) [12:50:21] !log installing ghostscript security updates [12:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:31] (03PS18) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [12:56:32] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh [12:56:36] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1002,1007].eqiad.wmnet with reason: Hardware refresh [12:57:31] RESOLVED: Not accepting/receiving prefixes from anycast BGP peer: Device asw1-b4-magru.mgmt.magru.wmnet recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [13:01:56] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez) [13:03:58] jouncebot: nowandnext [13:03:58] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [13:03:58] In 0 hour(s) and 56 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400) [13:04:18] mszabo and I will deploy some operations/mediawiki-config changes, unless anyone objects [13:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to drbd [13:05:24] RECOVERY - Host prometheus7001 is UP: PING OK - Packet loss = 0%, RTA = 115.41 ms [13:05:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to drbd [13:06:14] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: neutron-openvswitch-agent: prevent puppet from restarting the service [puppet] - 10https://gerrit.wikimedia.org/r/1098498 (https://phabricator.wikimedia.org/T380972) (owner: 10Arturo Borrero Gonzalez) [13:08:22] PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:08:50] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:09:45] (03PS1) 10KartikMistry: Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504 [13:09:46] Does anyone know if you need to update the chart version even if the chart doesn't change, to pull through a config.prod.yaml change in the build itself? [13:10:22] Had a deploy not work and wondering if it's because I only updated the build and not the chart? [13:12:36] (03CR) 10JMeybohm: "Sorry if I'm being too picky here. I feel like this increases the complexity of the chart by quite a bit (it has to, I'm aware) and I had " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:13:26] (03CR) 10KartikMistry: [C:03+2] Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504 (owner: 10KartikMistry) [13:13:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:34] (03Merged) 10jenkins-bot: Fix LANGUAGE_PAIRS_API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098504 (owner: 10KartikMistry) [13:14:38] (03CR) 10Mvolz: [C:03+2] "@akosiaris@wikimedia.org - this change had a change to config.dev.yaml... does the chart need to be incremented to pull this through?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot) [13:15:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to drbd [13:15:38] (03CR) 10Mvolz: [C:03+2] "* and config.prod.yaml, the relevant one :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098111 (owner: 10PipelineBot) [13:15:40] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [13:15:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1098470 (https://phabricator.wikimedia.org/T378344) (owner: 10Slyngshede) [13:15:50] PROBLEM - Host doh7002 is DOWN: PING CRITICAL - Packet loss = 100% [13:16:00] RECOVERY - Host doh7002 is UP: PING OK - Packet loss = 0%, RTA = 115.54 ms [13:16:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to drbd [13:17:04] PROBLEM - Bird Internet Routing Daemon on doh7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:17:05] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:17:38] (03CR) 10NMW03: "The script automatically did that, it made changes to the array order and the sizes of some logos (which are expected to be done automatic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [13:18:02] (03CR) 10Kosta Harlan: [C:03+1] Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [13:18:36] (03PS1) 10Máté Szabó: private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) [13:18:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:18:49] (03CR) 10Kosta Harlan: [C:03+1] private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó) [13:19:04] RECOVERY - Bird Internet Routing Daemon on doh7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:19:04] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7002 is OK: OK: UP (pid=2379) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:20:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [13:20:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [13:20:58] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye [13:22:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó) [13:23:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [13:23:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [13:23:41] (03Merged) 10jenkins-bot: private: Add stub for wgReportIncidentZendeskSubjectLine [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098506 (https://phabricator.wikimedia.org/T380868) (owner: 10Máté Szabó) [13:23:45] (03Merged) 10jenkins-bot: Configure IRS Zendesk integration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098480 (https://phabricator.wikimedia.org/T380908) (owner: 10Máté Szabó) [13:23:47] (03Merged) 10jenkins-bot: Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó) [13:24:15] !log mszabo@deploy2002 Started scap sync-world: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]] [13:24:23] T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868 [13:24:23] T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908 [13:24:23] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [13:25:55] (03CR) 10Anzx: [C:04-1] "some of image width are odd in unrelated changes other than azwikiquote" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [13:26:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to drbd [13:26:38] PROBLEM - Host durum7002 is DOWN: PING CRITICAL - Packet loss = 100% [13:27:32] RECOVERY - Host durum7002 is UP: PING OK - Packet loss = 0%, RTA = 115.56 ms [13:27:57] !log rebalance magru02 following switch of VMs back to DRBD T376737 [13:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:46] PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:28:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to drbd [13:29:08] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:30:06] !log mszabo@deploy2002 mszabo: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:08] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2431) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [13:30:18] T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868 [13:30:18] T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908 [13:30:19] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [13:30:30] RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:30:46] RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [13:30:50] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:19] !log mszabo@deploy2002 mszabo: Continuing with sync [13:35:42] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:37:23] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10361480 (10MoritzMuehlenhoff) [13:38:08] !log mszabo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098506|private: Add stub for wgReportIncidentZendeskSubjectLine (T380868)]], [[gerrit:1098480|Configure IRS Zendesk integration (T380908)]], [[gerrit:1093389|Configure instrument for the Incident Reporting System (T372823)]] (duration: 13m 53s) [13:38:19] T380868: Use the Zendesk API for creating tickets for emergency workflow - https://phabricator.wikimedia.org/T380868 [13:38:20] T380908: Allow Extension:ReportIncident to make POST requests to wikimediats.zendesk.com - https://phabricator.wikimedia.org/T380908 [13:38:22] T372823: Instrumentation for Incident Reporting System - https://phabricator.wikimedia.org/T372823 [13:39:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to drbd [13:39:48] PROBLEM - Host netflow7001 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:42] FIRING: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:40:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to drbd [13:40:48] RECOVERY - Host netflow7001 is UP: PING OK - Packet loss = 0%, RTA = 115.78 ms [13:41:30] (03PS7) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) [13:41:30] (03CR) 10Arnaudb: "XX has been set to 3600s (1hr) → please let me know if its not properly adjusted." [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [13:41:52] (03PS4) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) [13:41:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361494 (10aborrero) the server has been drained and is ready for a reboot when you need it. [13:42:10] backport looks okay [13:42:54] (03CR) 10Arnaudb: mariadb: add innodb buffer pool usage monitoring (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1098479 (https://phabricator.wikimedia.org/T375589) (owner: 10Arnaudb) [13:43:07] (03CR) 10NMW03: "Manually removed that part. Again, that was not me who changed that and I think it is quite normal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [13:43:32] (03CR) 10Anzx: [C:03+1] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [13:44:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1097378 (owner: 10Slyngshede) [13:45:42] RESOLVED: [3x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:45:54] !log installing php8.2 security updates [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:12] FIRING: [2x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:50:57] FIRING: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:25] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1097336 (owner: 10Slyngshede) [13:54:39] jouncebot: next [13:54:39] In 0 hour(s) and 5 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400) [13:55:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to drbd [13:56:14] PROBLEM - Host install7001 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:24] RECOVERY - Host install7001 is UP: PING OK - Packet loss = 0%, RTA = 115.63 ms [13:57:10] FIRING: [14x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:57:19] RESOLVED: [4x] JobUnavailable: Reduced availability for job fastnetmon in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to drbd [13:59:11] FIRING: [14x] ProbeDown: Service install7001:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1400). [14:00:04] cscott and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:14] i can deploy today [14:00:16] o/ [14:00:20] o/ [14:00:36] hi martin o/ thanks [14:00:50] (03PS2) 10C. Scott Ananian: Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) [14:00:52] (03CR) 10Urbanecm: [C:03+2] Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [14:01:13] (03PS3) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) [14:01:16] (03CR) 10Urbanecm: [C:03+2] Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian) [14:01:25] (03PS1) 10Jgreen: Records for analytics*.frdev for consistency and new service. [dns] - 10https://gerrit.wikimedia.org/r/1098508 (https://phabricator.wikimedia.org/T377363) [14:01:42] (03PS5) 10NMW03: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) [14:01:45] (03CR) 10Urbanecm: [C:03+2] Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [14:02:02] (03Merged) 10jenkins-bot: Enable ParserMigration compact indicator on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098076 (https://phabricator.wikimedia.org/T363484) (owner: 10C. Scott Ananian) [14:02:07] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian) [14:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [14:02:32] (03Merged) 10jenkins-bot: Updated wordmark for Azerbaijani Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098019 (https://phabricator.wikimedia.org/T380974) (owner: 10NMW03) [14:03:01] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]] [14:03:10] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [14:03:10] T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394 [14:03:10] T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401 [14:03:11] T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974 [14:03:29] (03PS3) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) [14:03:33] (03CR) 10Urbanecm: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [14:04:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:04:12] (03PS1) 10Slyngshede: Migrate UI customizations to a theme [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1098510 (https://phabricator.wikimedia.org/T380172) [14:04:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/1097935 (https://phabricator.wikimedia.org/T380402) (owner: 10Slyngshede) [14:05:12] FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:05:34] (03PS2) 10Abijeet Patro: Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) [14:06:29] (03PS1) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 [14:06:29] (03PS1) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [14:08:49] !log urbanecm@deploy2002 urbanecm, cscott, nmw03: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to drbd [14:08:54] PROBLEM - Host ncredir7001 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:57] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [14:08:57] T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394 [14:08:57] T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401 [14:08:58] T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974 [14:09:03] cscott: Nemoralis: can you test your patches, please? [14:09:09] oh it [14:09:14] RECOVERY - Host ncredir7001 is UP: PING OK - Packet loss = 0%, RTA = 115.68 ms [14:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to drbd [14:10:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:10:16] urbanecm: LGTM [14:10:23] ty [14:11:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361606 (10Jclark-ctr) Dell rejected parts request opening new ticket with them 201666996 [14:12:14] (03PS2) 10Muehlenhoff: debmonitor: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093350 [14:12:28] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:12:30] PROBLEM - BFD status on asw1-b3-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:13:32] urbanecm: both of my patches look good to me on canaries [14:13:41] great! proceeding [14:13:44] !log urbanecm@deploy2002 urbanecm, cscott, nmw03: Continuing with sync [14:14:03] (03CR) 10Urbanecm: [C:03+2] [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [14:14:47] (03Merged) 10jenkins-bot: [GrowthExperiments] Undefine wgGEDatabaseCluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1097309 (https://phabricator.wikimedia.org/T354939) (owner: 10Urbanecm) [14:15:54] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010) [14:16:03] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010) [14:17:43] (03CR) 10Jelto: "I left some comments in-line. Keep in mind to also bump the `version` in `Charts.yaml`." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [14:19:11] (03PS2) 10Muehlenhoff: Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 [14:19:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to drbd [14:20:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to drbd [14:20:06] PROBLEM - Host durum7001 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:22] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098076|Enable ParserMigration compact indicator on all wikis (T363484)]], [[gerrit:1093405|Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (T375394 T380401)]], [[gerrit:1098019|Updated wordmark for Azerbaijani Wikiquote (T380974)]] (duration: 17m 20s) [14:20:28] RECOVERY - Host durum7001 is UP: PING OK - Packet loss = 0%, RTA = 115.64 ms [14:20:29] T363484: Update ParserMigration notice - https://phabricator.wikimedia.org/T363484 [14:20:29] T375394: Deploy Parsoid Read Views to de/ru wikivoyage (week of 2024-11-25) - https://phabricator.wikimedia.org/T375394 [14:20:30] T380401: Deploy Parsoid Read Views to dagwiki (week of 2024-11-25) - https://phabricator.wikimedia.org/T380401 [14:20:30] T380974: Update azwikiquote wordmark - https://phabricator.wikimedia.org/T380974 [14:20:30] Nemoralis: cscott: pushed to prod! [14:20:36] anything else? [14:20:38] urbanecm: thank you! [14:20:42] any time [14:20:42] (03CR) 10Clément Goubert: [C:03+1] Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff) [14:20:59] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]] [14:21:04] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [14:21:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye [14:21:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10361675 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye [14:21:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [14:21:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10361676 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye [14:22:06] PROBLEM - Bird Internet Routing Daemon on durum7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:22:06] PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:22:42] urbanecm: thanks [14:23:49] urbanecm: is there any cache for the logos? [14:24:00] thanks for the reminder, let me purge [14:24:06] RECOVERY - Bird Internet Routing Daemon on durum7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:24:06] RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7001 is OK: OK: UP (pid=2388) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:24:09] it works fine in test server, but prod is still old [14:24:42] !log Purge https://en.wikipedia.org/static/images/mobile/copyright/wikiquote-wordmark-az.svg (T380974) [14:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:47] Nemoralis: what about now? [14:25:00] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673 [14:25:04] T380673: Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673 [14:25:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361683 (10Jclark-ctr) Finished with bios update waiting on dell for response for new ticket [14:25:13] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1061.eqiad.wmnet with reason: cloudvirt1061 needs maintenance T380673 [14:25:16] urbanecm: nice, thanks! [14:26:27] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:26:32] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [14:26:35] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:27:42] FIRING: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Kernel error Server cloudvirt1061 may have kernel errors - https://phabricator.wikimedia.org/T380673#10361703 (10fnegri) 05Open→03In progress [14:30:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to drbd [14:31:48] PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:32:42] RESOLVED: JobUnavailable: Reduced availability for job wikidough in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:33:04] PROBLEM - Bird Internet Routing Daemon on doh7001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:33:13] ^ downtiming and will check [14:33:21] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1097309|[GrowthExperiments] Undefine wgGEDatabaseCluster (T354939)]] (duration: 12m 21s) [14:33:22] probably related to the wowrk moritzm is doing [14:33:25] T354939: Migrate GrowthExperiments to virtual domains - https://phabricator.wikimedia.org/T354939 [14:33:33] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on doh[7001-7002].wikimedia.org with reason: site is depooled, maintenance [14:33:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on doh[7001-7002].wikimedia.org with reason: site is depooled, maintenance [14:34:04] RECOVERY - Bird Internet Routing Daemon on doh7001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:34:28] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:30] RECOVERY - BFD status on asw1-b3-magru.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:34:48] RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7001 is OK: OK: UP (pid=2388) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [14:35:28] sukhe: yeah, that was the switch of the VM back to DRBD [14:35:40] !log rebalance magru01 following switch of VMs back to DRBD T376737 [14:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [14:39:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [14:39:45] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361778 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs [14:41:24] (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525 [14:43:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [14:44:08] (03CR) 10KartikMistry: [C:03+2] Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525 (owner: 10KartikMistry) [14:45:10] (03Merged) 10jenkins-bot: Update recommendation-api to 2024-11-27-142924-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098525 (owner: 10KartikMistry) [14:47:19] (03PS3) 10Elukey: admin: add Jimmy Ly's account [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) [14:48:15] !log kartik@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:48:28] Deploying rec-api ^^ [14:49:08] (03CR) 10Elukey: [C:04-1] "Still missing the approval for the deployment group, waiting for Tyler's +1 in the task." [puppet] - 10https://gerrit.wikimedia.org/r/1098024 (https://phabricator.wikimedia.org/T380525) (owner: 10Elukey) [14:51:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to drbd [14:51:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10361832 (10elukey) >>! In T380525#10357542, @Jly wrote: > @elukey Got it, I have updated the key now, please see All good thanks! I updated the c... [14:52:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to deployment & stats private data access for jly - https://phabricator.wikimedia.org/T380525#10361834 (10elukey) [14:52:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361838 (10ops-monitoring-bot) VM ml-etcd1003.eqiad.wmnet switching disk type to drbd [14:52:30] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Sspalding - https://phabricator.wikimedia.org/T380820#10361835 (10elukey) 05Open→03Resolved a:03elukey Closing for the moment, please re-open if needed! [14:54:48] (03CR) 10Nikerabbit: [C:03+1] Enable message group subscription feature for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098509 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:56:35] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for JLy-WMF - https://phabricator.wikimedia.org/T380523#10361858 (10elukey) 05Open→03Resolved a:03elukey ` elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep jly member: uid=jly,ou=people,dc=wikimedia,dc=org ` Added! [14:58:10] !log kartik@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:59:19] !log kartik@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [15:00:48] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1500) [15:01:04] (03CR) 10Vgutierrez: "canonical domains are curated in hiera under the key `wikimedia_domains` defined on hieradata/common.yaml, any chance of reusing it?" [puppet] - 10https://gerrit.wikimedia.org/r/1092359 (https://phabricator.wikimedia.org/T374640) (owner: 10BCornwall) [15:01:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to drbd [15:01:08] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:28] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [15:02:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [15:02:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361888 (10ops-monitoring-bot) Draining ganeti1022.eqiad.wmnet of running VMs [15:02:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [15:03:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ml-etcd1003.eqiad.wmnet to plain [15:03:55] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10361893 (10ops-monitoring-bot) VM ml-etcd1003.eqiad.wmnet switching disk type to plain [15:04:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ml-etcd1003.eqiad.wmnet to plain [15:05:23] !log Updated recommendation-api to 2024-11-27-142924-production (T380838, T379036, T380699) [15:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:29] (03PS8) 10Clément Goubert: mediawiki: Add mwcron feature [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) [15:05:30] T380838: recommendation API server fails to fill cache - https://phabricator.wikimedia.org/T380838 [15:05:30] T379036: Update cache in a single thread - https://phabricator.wikimedia.org/T379036 [15:05:31] T380699: recommendation-api /api/v1/translation/page-collections throws 500 when cache is empty - https://phabricator.wikimedia.org/T380699 [15:05:42] (03CR) 10Clément Goubert: mediawiki: Add mwcron feature (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [15:05:43] (03PS2) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 [15:05:43] (03PS2) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [15:05:43] (03PS1) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 [15:05:43] (03PS1) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 [15:05:52] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester) [15:06:56] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-19-140330 to 2024-11-27-074306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098518 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester) [15:08:15] !log krinkle@webperf2003: `sudo apt-get install kafkacat` (matching webperf1003, for ad-hoc debugging) [15:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:47] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:09:39] 06SRE, 10LDAP-Access-Requests: Grant Access to NDA-users for ncreasy - https://phabricator.wikimedia.org/T380097#10361923 (10elukey) 05Open→03Resolved a:03elukey @NCreasy the wmf group should be enough, you are free to play with DataHub, all perms should be set. If you find any issue please re-open t... [15:09:58] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:13:03] (03PS19) 10Hnowlan: mediawiki: add mercurius features [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) [15:13:57] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10361941 (10elukey) Hi! I am looping in @KFrancis since afaics we need to sign an NDA before proceeding. @KFrancis could you please take a look? Thanks in advance :) [15:15:21] (03CR) 10Btullis: [C:03+1] airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [15:15:39] (03CR) 10Brouberol: [C:03+2] airflow-wmde: stop managing the airflow instance via puppet [puppet] - 10https://gerrit.wikimedia.org/r/1097308 (https://phabricator.wikimedia.org/T380622) (owner: 10Brouberol) [15:16:05] (03CR) 10Muehlenhoff: [C:03+2] puppetboard: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093873 (owner: 10Muehlenhoff) [15:20:21] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:20:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Suzanne Wood (WMDE) - https://phabricator.wikimedia.org/T380994 (10SuzanneWood-WMDE) 03NEW [15:21:10] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:21:31] (03PS2) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 [15:22:02] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:22:54] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:25:49] (03CR) 10Ecarg: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester) [15:26:58] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-19-132736 to 2024-11-26-193226 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098519 (https://phabricator.wikimedia.org/T139010) (owner: 10Jforrester) [15:27:36] !log ecarg@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:28:33] !log ecarg@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:28:45] (03PS1) 10Elukey: admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) [15:28:54] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:28:54] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:28:56] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp7010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [15:28:59] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [15:29:24] PROBLEM - SSH on cp7010 is CRITICAL: connect to address 10.140.1.2 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:30:03] !log ecarg@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:30:12] 06SRE, 10LDAP-Access-Requests: Access to Data Hub - IAckerman-WMF - https://phabricator.wikimedia.org/T380091#10362054 (10elukey) 05Open→03Resolved a:03elukey ` elukey@mwmaint1002:~$ sudo ldapsearch -x cn=wmf | grep iacke member: uid=iackerman,ou=people,dc=wikimedia,dc=org ` Added! I don't think tha... [15:30:32] (03CR) 10Muehlenhoff: [C:03+2] Add umbrella Cumin alias for wikikube staging cluster [puppet] - 10https://gerrit.wikimedia.org/r/1092776 (owner: 10Muehlenhoff) [15:30:52] !log ecarg@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:31:04] !log ecarg@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:31:26] RECOVERY - SSH on cp7010 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:31:54] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.0 200 OK - 36187 bytes in 0.407 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:31:54] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp7010 is OK: HTTP OK: HTTP/1.1 200 OK - 48376 bytes in 0.465 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [15:31:56] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp7010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 549067 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-10-17 23:59:59 +0000 (expires in 324 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:31:58] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp7010 is OK: SSL OK - OCSP staple validity for wikipedia.org has 549065 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2025-10-17 23:59:59 +0000 (expires in 324 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:32:14] !log ecarg@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:32:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey) [15:32:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:33:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:33:12] (03CR) 10CI reject: [V:04-1] admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey) [15:33:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71215 and previous config saved to /var/cache/conftool/dbconfig/20241127-153316-ladsgroup.json [15:33:22] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:36:02] (03PS3) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 [15:37:00] (03PS3) 10Elukey: modules: add mesh.configuration 1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) [15:37:02] (03PS3) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [15:37:02] (03PS2) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.11.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [15:37:03] (03PS4) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [15:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:41:32] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1026.eqiad.wmnet with OS bullseye [15:41:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1026.eqiad.wmnet with OS bullseye executed with errors... [15:41:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye [15:41:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362157 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors... [15:42:51] (03CR) 10Muehlenhoff: [C:03+2] Make docker::baseimages ensurable [puppet] - 10https://gerrit.wikimedia.org/r/1094393 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [15:48:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71216 and previous config saved to /var/cache/conftool/dbconfig/20241127-154823-ladsgroup.json [15:48:28] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [15:49:56] (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) [15:51:07] (03PS2) 10Brouberol: airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) [15:51:11] (03CR) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ru wikivoyage and dagwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394) (owner: 10C. Scott Ananian) [15:51:27] (03CR) 10Clément Goubert: [C:03+1] kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [15:51:34] (03PS3) 10Brouberol: airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) [15:52:12] (03CR) 10Btullis: [C:03+1] airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) (owner: 10Brouberol) [15:55:11] (03PS6) 10Majavah: dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) [15:55:45] (03CR) 10David Caro: Example of QoS rules for cloudcephosd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [15:56:04] (03CR) 10Majavah: [C:03+2] dynamicproxy: Listen on IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1091802 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [15:58:06] (03CR) 10Kamila Součková: [C:03+1] "🎉" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:59:17] (03PS1) 10C. Scott Ananian: Allow defaulting to Parsoid Read Views when MobileFrontEnd is active [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 [15:59:19] (03CR) 10Majavah: mediawiki: add mercurius features (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080583 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [15:59:34] (03CR) 10Brouberol: [C:03+2] airflow: add kerberos-related environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098094 (https://phabricator.wikimedia.org/T380765) (owner: 10Brouberol) [15:59:43] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:00:13] (03CR) 10C. Scott Ananian: [C:04-2] "On hold pending train deploy of the Depends-On patch and the December deployment freeze." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (owner: 10C. Scott Ananian) [16:02:28] (03PS3) 10Majavah: dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) [16:02:28] (03PS11) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [16:02:28] (03PS10) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [16:03:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71217 and previous config saved to /var/cache/conftool/dbconfig/20241127-160330-ladsgroup.json [16:04:34] (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1002 with kafka-main1007 [puppet] - 10https://gerrit.wikimedia.org/r/1098548 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:04:36] (03CR) 10Majavah: [C:03+2] dynamicproxy: Run Redis update in app context [puppet] - 10https://gerrit.wikimedia.org/r/1091848 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:05:24] !log fabfur@cumin1002 START - Cookbook sre.dns.netbox [16:11:03] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp70101 - fabfur@cumin1002" [16:11:07] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cp70101 - fabfur@cumin1002" [16:11:08] !log fabfur@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:28] !log installing distro-info-data updates from bookworm point release [16:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:48] !log roll restarting kafka-main brokers - T363214 [16:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:52] T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214 [16:13:20] (03PS1) 10Fabfur: hiera: fix magru ip addresses during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) [16:15:21] (03CR) 10Ssingh: hiera: fix magru ip addresses during migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur) [16:15:38] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:22] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10362308 (10MoritzMuehlenhoff) [16:16:25] !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad [16:16:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:16:51] (03CR) 10Majavah: [C:03+2] dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:17:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52922 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:17:28] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:18:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P71218 and previous config saved to /var/cache/conftool/dbconfig/20241127-161837-ladsgroup.json [16:18:42] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey) [16:19:03] (03CR) 10Elukey: [C:03+2] profile::service_proxy::envoy: add tegola [puppet] - 10https://gerrit.wikimedia.org/r/1097333 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:19:05] (03Abandoned) 10Muehlenhoff: Cover one more case in the setup of Envoy firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/1082806 (owner: 10Muehlenhoff) [16:19:22] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362331 (10Fabfur) [16:22:32] (03PS2) 10C. Scott Ananian: Allow defaulting to Parsoid Read Views when MobileFrontEnd is active [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098549 (https://phabricator.wikimedia.org/T381002) [16:23:48] (03PS2) 10Fabfur: hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) [16:24:00] (03CR) 10Ssingh: [C:03+1] hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur) [16:24:17] (03CR) 10Fabfur: hiera: fix magru dns7001 ip address during migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur) [16:24:57] (03CR) 10Fabfur: [C:03+2] hiera: fix magru dns7001 ip address during migration [puppet] - 10https://gerrit.wikimedia.org/r/1098554 (https://phabricator.wikimedia.org/T380307) (owner: 10Fabfur) [16:25:15] (03PS1) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 [16:26:13] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7010.magru.wmnet with OS bullseye [16:26:26] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic, 13Patch-For-Review: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7010... [16:26:37] (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [16:27:05] (03PS2) 10Muehlenhoff: cloudweb: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1098556 [16:27:45] (03CR) 10Dzahn: [C:03+1] mailman: run tasks every 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/1098489 (https://phabricator.wikimedia.org/T377045) (owner: 10AOkoth) [16:27:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad [16:28:50] (03CR) 10Muehlenhoff: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [16:28:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [16:33:44] (03CR) 10Majavah: cloudweb: Restrict access to Envoy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1098556 (owner: 10Muehlenhoff) [16:33:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T370903)', diff saved to https://phabricator.wikimedia.org/P71220 and previous config saved to /var/cache/conftool/dbconfig/20241127-163344-ladsgroup.json [16:33:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [16:33:51] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:34:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1192.eqiad.wmnet with reason: Maintenance [16:34:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71221 and previous config saved to /var/cache/conftool/dbconfig/20241127-163407-ladsgroup.json [16:35:07] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10362431 (10fnegri) [16:36:08] (03CR) 10JMeybohm: [C:03+2] k8s.reboot-nodes: Limit allowed aliases to those of the k8s cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm) [16:36:26] (03PS1) 10Muehlenhoff: Assign builder role to build2002 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) [16:39:02] (03PS4) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) [16:39:02] (03PS4) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [16:39:03] (03PS3) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [16:39:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:39:03] (03PS5) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [16:40:50] (03CR) 10Bking: [C:03+2] wdqs-ldf: Make Data Platform SRE the recipient of the LDF alerts [puppet] - 10https://gerrit.wikimedia.org/r/1097441 (https://phabricator.wikimedia.org/T379182) (owner: 10Bking) [16:40:54] (03CR) 10Dwisehaupt: [C:03+2] Records for analytics*.frdev for consistency and new service. [dns] - 10https://gerrit.wikimedia.org/r/1098508 (https://phabricator.wikimedia.org/T377363) (owner: 10Jgreen) [16:41:18] (03PS2) 10Muehlenhoff: Assign builder role to build2002 (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) [16:41:50] (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) [16:41:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [16:42:19] (03Merged) 10jenkins-bot: k8s.reboot-nodes: Limit allowed aliases to those of the k8s cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1098091 (owner: 10JMeybohm) [16:43:07] (03PS5) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) [16:43:07] (03PS5) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [16:43:07] (03PS4) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [16:43:07] (03PS6) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [16:45:18] (03PS6) 10Elukey: modules: add mesh.configuration 1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098511 (https://phabricator.wikimedia.org/T322647) [16:45:18] (03PS6) 10Elukey: modules: add health checks to the mesh's _tcp_cluster config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098512 (https://phabricator.wikimedia.org/T322647) [16:45:18] (03PS5) 10Elukey: charts: update tegola-vector-tiles to mesh.configuration:1.10.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098530 (https://phabricator.wikimedia.org/T322647) [16:45:18] (03PS7) 10Elukey: services: add health checks to Tegola's postgres TCP proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098531 (https://phabricator.wikimedia.org/T322647) [16:47:14] (03CR) 10JMeybohm: [C:03+1] Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [16:47:18] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [16:48:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71222 and previous config saved to /var/cache/conftool/dbconfig/20241127-164843-ladsgroup.json [16:48:48] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [16:49:43] (03PS1) 10Máté Szabó: Allow IRS to record server-side interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098561 (https://phabricator.wikimedia.org/T380599) [16:51:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [16:51:11] 06SRE, 10SRE-tools, 10Data-Platform-SRE (2024.11.09 - 2024.11.29), 03Discovery-Search (Current work): Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507#10362546 (10bking) 05Open→03Resolved a:03bking [16:52:05] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:52:33] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:52:34] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:52:43] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:52:45] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:53:19] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:53:21] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:53:35] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:53:37] !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [16:54:15] !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [16:54:16] !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [16:54:33] !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [16:54:35] !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [16:54:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1098558 (https://phabricator.wikimedia.org/T379343) (owner: 10Muehlenhoff) [16:54:50] !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [16:54:51] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:55:26] !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:55:28] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [16:56:06] !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:03:22] (03Abandoned) 10Raymond Ndibe: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe) [17:03:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71224 and previous config saved to /var/cache/conftool/dbconfig/20241127-170350-ladsgroup.json [17:06:01] (03PS1) 10Majavah: P:toolforge: checker: Update Redis address [puppet] - 10https://gerrit.wikimedia.org/r/1098564 [17:06:01] (03PS1) 10Majavah: P:toolforge: Update ToolsDB address [puppet] - 10https://gerrit.wikimedia.org/r/1098565 [17:07:34] (03PS1) 10C. Scott Ananian: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 [17:07:56] (03PS2) 10C. Scott Ananian: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) [17:08:47] (03CR) 10David Caro: [C:03+1] "LGTM, there's also some stuff in the secrets repo" [puppet] - 10https://gerrit.wikimedia.org/r/1098095 (https://phabricator.wikimedia.org/T380893) (owner: 10Andrew Bogott) [17:08:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian) [17:09:06] (03CR) 10Majavah: [C:03+2] P:toolforge: checker: Update Redis address [puppet] - 10https://gerrit.wikimedia.org/r/1098564 (owner: 10Majavah) [17:09:12] (03CR) 10Jdlrobson: [C:04-1] Reenable non-UI experiment quick survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091749 (https://phabricator.wikimedia.org/T379241) (owner: 10Bernard Wang) [17:09:12] (03CR) 10Majavah: [C:03+2] P:toolforge: Update ToolsDB address [puppet] - 10https://gerrit.wikimedia.org/r/1098565 (owner: 10Majavah) [17:14:28] (03PS1) 10Majavah: hieradata: Drop eqiad.wmflabs from DNS search domains [puppet] - 10https://gerrit.wikimedia.org/r/1098571 (https://phabricator.wikimedia.org/T305834) [17:14:44] !log jiji@cumin1002 START - Cookbook sre.hosts.remove-downtime for kafka-main1007.eqiad.wmnet [17:14:45] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kafka-main1007.eqiad.wmnet [17:15:11] (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [17:16:21] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [17:16:24] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [17:16:44] (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098559 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli) [17:17:03] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7010.magru.wmnet with OS bullseye [17:17:07] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7010.magru.wmnet with OS bulls... [17:17:35] (03CR) 10Alexandros Kosiaris: [C:04-1] "This is a heavy change to the chart, to the point I am wondering whether all these if clauses will pose a burden to us in the future. Anyw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [17:18:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P71225 and previous config saved to /var/cache/conftool/dbconfig/20241127-171857-ladsgroup.json [17:19:32] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply [17:20:15] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply [17:20:48] (03CR) 10CI reject: [V:04-1] Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian) [17:22:03] (03PS1) 10C. Scott Ananian: Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) [17:22:36] (03PS1) 10Fabfur: Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573 [17:22:42] (03CR) 10Andrew Bogott: [C:03+2] hieradata: Drop eqiad.wmflabs from DNS search domains [puppet] - 10https://gerrit.wikimedia.org/r/1098571 (https://phabricator.wikimedia.org/T305834) (owner: 10Majavah) [17:23:51] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [17:23:54] (03CR) 10Ssingh: [C:03+1] Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573 (owner: 10Fabfur) [17:24:34] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [17:24:36] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:25:23] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [17:27:16] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [17:27:55] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [17:27:56] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [17:28:27] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [17:31:20] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [17:31:23] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [17:31:54] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [17:31:55] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [17:32:06] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [17:32:08] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [17:32:21] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:32:28] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [17:32:29] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:32:37] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [17:34:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T370903)', diff saved to https://phabricator.wikimedia.org/P71226 and previous config saved to /var/cache/conftool/dbconfig/20241127-173403-ladsgroup.json [17:34:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [17:34:10] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:34:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1203.eqiad.wmnet with reason: Maintenance [17:34:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71227 and previous config saved to /var/cache/conftool/dbconfig/20241127-173426-ladsgroup.json [17:35:36] (03CR) 10Elukey: [C:03+2] admin: add sspalding to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1098545 (https://phabricator.wikimedia.org/T380820) (owner: 10Elukey) [17:39:04] (03PS1) 10Chlod Alejandro: Increase Nuke max age to 90 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098574 (https://phabricator.wikimedia.org/T380846) [17:40:15] PROBLEM - Host lvs7003 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:33] RECOVERY - Host lvs7003 is UP: PING OK - Packet loss = 0%, RTA = 115.12 ms [17:42:43] PROBLEM - PyBal backends health check on lvs7003 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:44:16] PROBLEM - pybal on lvs7003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:44:43] RECOVERY - PyBal backends health check on lvs7003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:44:49] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:45:16] RECOVERY - pybal on lvs7003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:45:40] (03CR) 10Fabfur: [C:03+2] Revert "magru: set check_min_fe_mem false" [puppet] - 10https://gerrit.wikimedia.org/r/1098573 (owner: 10Fabfur) [17:45:56] (03CR) 10Clément Goubert: "I've answered a few of your questions, some are already sort of addressed in PS8. I'll add more comments tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1076746 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [17:49:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71228 and previous config saved to /var/cache/conftool/dbconfig/20241127-174911-ladsgroup.json [17:49:17] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [17:52:03] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362844 (10Fabfur) lvs7003 has been restarted after cable swap, all fine [17:52:09] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362843 (10Fabfur) Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/1098573 and ran puppet agent on `A:cp-magru`: NOOP as ex... [17:54:49] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T1800) [18:02:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:03:55] FIRING: MaxConntrack: Max conntrack at 91.99% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:04:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71229 and previous config saved to /var/cache/conftool/dbconfig/20241127-180418-ladsgroup.json [18:04:25] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:05:25] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 40 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:05:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [18:05:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10362870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye [18:06:55] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10362873 (10Fabfur) BGP flag enabled on NetBox for lvs700[1-3] and dns700[12] [18:08:55] RESOLVED: MaxConntrack: Max conntrack at 93.44% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [18:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P71230 and previous config saved to /var/cache/conftool/dbconfig/20241127-181925-ladsgroup.json [18:20:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian) [18:30:08] (03CR) 10Arlolra: [C:03+1] Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian) [18:31:02] (03CR) 10Arlolra: [C:03+1] "recheck" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian) [18:34:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T370903)', diff saved to https://phabricator.wikimedia.org/P71231 and previous config saved to /var/cache/conftool/dbconfig/20241127-183432-ladsgroup.json [18:34:35] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [18:34:38] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:34:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1209.eqiad.wmnet with reason: Maintenance [18:34:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71232 and previous config saved to /var/cache/conftool/dbconfig/20241127-183455-ladsgroup.json [18:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [18:36:58] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [18:37:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [18:37:20] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dns7001.wikimedia.org with reason: T380307 [18:37:22] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dns7001.wikimedia.org with reason: T380307 [18:37:24] T380307: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307 [18:37:57] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns7001.wikimedia.org [18:37:57] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7001.wikimedia.org [18:38:02] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org [18:38:02] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org [18:38:20] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7001.magru.wmnet [18:38:20] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7001.magru.wmnet [18:38:24] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7002.magru.wmnet [18:38:24] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7002.magru.wmnet [18:38:28] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs7003.magru.wmnet [18:38:29] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs7003.magru.wmnet [18:38:37] !log fabfur@cumin1002 START - Cookbook sre.hosts.remove-downtime for 16 hosts [18:38:44] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 16 hosts [18:40:06] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363009 (10Fabfur) Removed downtime from all lvs, dns and cp hosts in magru [18:46:43] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) [18:47:04] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: cluster=dnsbox,dc=magru [18:49:41] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363039 (10Fabfur) Repooled dnsbox cluster and run authdns-update [18:49:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71233 and previous config saved to /var/cache/conftool/dbconfig/20241127-184946-ladsgroup.json [18:50:01] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [18:50:39] (03CR) 10C. Scott Ananian: [C:03+1] Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra) [18:52:05] (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) [18:53:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra) [18:54:53] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10363056 (10cmooney) So I've been meaning to look at this for ages and while how to generate the records were clear to me, how to update the existing [[ https://gerrit.wikimedia.org/... [18:56:23] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [18:56:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363080 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:... [19:00:57] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10363115 (10dcaro) A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone: http://w... [19:02:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:02:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [19:04:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71235 and previous config saved to /var/cache/conftool/dbconfig/20241127-190453-ladsgroup.json [19:05:37] jouncebot: nowandnext [19:05:37] No deployments scheduled for the next 1 hour(s) and 54 minute(s) [19:05:38] In 1 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2100) [19:06:07] I am going to deploy a change to puppet code that installs scap. Disabling puppet on "R:scap:target" for a few minutes. [19:06:32] but it's expected to be all noop on any existing scap::target [19:06:55] it's about fixing an issue on new hosts that get scap installed the first time [19:09:11] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: dc=magru,service=cdn [19:09:28] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363181 (10Fabfur) ran puppet-agent on `A:magru` [19:10:04] (03CR) 10Dzahn: [C:03+2] scap target: ensure scap is installed on host before it is required [puppet] - 10https://gerrit.wikimedia.org/r/1092841 (https://phabricator.wikimedia.org/T378769) (owner: 10Jaime Nuche) [19:12:15] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363189 (10Fabfur) Repooled all depooled cp hosts before repooling whole DC [19:13:04] !log disabled puppet on R:scap::target (180 hosts) for a short time - deploying gerrit:1092841 [19:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:19] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 180187 MB (4% inode=92%): /srv/swift-storage/sdg1 205941 MB (5% inode=92%): /srv/swift-storage/sdc1 151439 MB (3% inode=90%): /srv/swift-storage/sdh1 167101 MB (4% inode=91%): /srv/swift-storage/sde1 178339 MB (4% inode=92%): /srv/swift-storage/sdd1 154510 MB (4% inode=91%): /srv/swift-storage/sdj1 175371 MB (4% inode=92%): /srv/swift-st [19:14:19] k1 169838 MB (4% inode=92%): /srv/swift-storage/sdi1 169300 MB (4% inode=92%): /srv/swift-storage/sdl1 199607 MB (5% inode=92%): /srv/swift-storage/sdn1 186908 MB (4% inode=92%): /srv/swift-storage/sdm1 188628 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [19:14:39] (03PS1) 10Bartosz Dziewoński: Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) [19:14:49] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [19:14:55] !log mforns@deploy2002 Started deploy [airflow-dags/analytics@99032bf]: regular weekly train [19:15:34] interesting, jouncebot claimed nothing is going to be deployed but also it's a regular weekly train [19:16:27] mutante: analytics train, not mw train :) [19:16:49] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:17:01] rzl: well, what I am really wondering is just if that uses scap [19:17:51] !log mforns@deploy2002 Finished deploy [airflow-dags/analytics@99032bf]: regular weekly train (duration: 03m 10s) [19:18:03] !log brett@cumin2002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: repool magru, T376737] [19:18:10] !log brett@cumin2002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: repool magru, T376737] [19:18:14] it's ok, if anything this is about seeing puppet errors, not scap deployment errors [19:18:15] mutante: for airflow it looks like yes [19:18:30] just reading https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Instances#analytics [19:18:34] but technically if I ask jouncebot I expected that to mean all [19:18:37] thanks rzl [19:18:52] rzl, jhathaway: magru has been repooled now [19:19:08] brett: ack, thanks! [19:20:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P71236 and previous config saved to /var/cache/conftool/dbconfig/20241127-192000-ladsgroup.json [19:21:47] 07Puppet, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927#10363222 (10Andrew) This has not recurred. Nevertheless we should figure out what's happening with the ruby functions that don't rai... [19:21:56] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 06Traffic: installation tracking for hosts affected by magru re-shuffle - https://phabricator.wikimedia.org/T380307#10363224 (10Fabfur) Repooled magru DC [19:22:11] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Tab completion for cookbook names - https://phabricator.wikimedia.org/T367230#10363226 (10Volans) @JMeybohm that practically covers the current production use case, but is not future proof as it doesn't cover all the generic cases. Hence why I said I wa... [19:22:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:22:56] I've got a brand-new R450 that gives the message "Unified Server Configurator does not support console redirection" when I try to connect to its console, has anyone seen that before? [19:23:10] oops, meant to post that in dc ops [19:23:28] sounds like it did not get the BIOS settings to enable console [19:23:53] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1025.eqiad.wmnet with OS bullseye [19:24:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:... [19:24:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:24:34] I have re-enabled puppet on all the scap::target hosts. So far I see no issues EXCEPT on 2 phab hosts. [19:24:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [19:24:45] but I am watching puppetboard for any others. [19:25:11] if there is anything then it's a puppet dependency cycle [19:25:49] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:25:54] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1027.eqiad.wmnet with OS bullseye [19:26:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye executed with errors... [19:27:49] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:31:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance [19:31:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2034.codfw.wmnet with reason: Maintenance [19:32:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71237 and previous config saved to /var/cache/conftool/dbconfig/20241127-193202-ladsgroup.json [19:32:36] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1025.eqiad.wmnet with OS bullseye [19:34:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:35:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T370903)', diff saved to https://phabricator.wikimedia.org/P71238 and previous config saved to /var/cache/conftool/dbconfig/20241127-193507-ladsgroup.json [19:35:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [19:35:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:35:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1211.eqiad.wmnet with reason: Maintenance [19:35:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71239 and previous config saved to /var/cache/conftool/dbconfig/20241127-193529-ladsgroup.json [19:36:24] !log imported jenkins 2.479.2 to thirdparty/ci for bullseye-wikimedia [19:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71240 and previous config saved to /var/cache/conftool/dbconfig/20241127-193858-ladsgroup.json [19:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:40:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363420 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:... [19:42:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [19:50:00] !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host wdqs1025.eqiad.wmnet with OS bullseye [19:50:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363552 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye executed with errors:... [19:50:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1026.eqiad.wmnet with OS bullseye [19:50:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363555 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1026.eqiad.wmnet with OS bullseye [19:51:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71241 and previous config saved to /var/cache/conftool/dbconfig/20241127-195129-ladsgroup.json [19:51:34] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [19:54:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P71242 and previous config saved to /var/cache/conftool/dbconfig/20241127-195406-ladsgroup.json [20:06:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71243 and previous config saved to /var/cache/conftool/dbconfig/20241127-200636-ladsgroup.json [20:09:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034', diff saved to https://phabricator.wikimedia.org/P71244 and previous config saved to /var/cache/conftool/dbconfig/20241127-200913-ladsgroup.json [20:18:05] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1026.eqiad.wmnet with reason: host reimage [20:20:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1026.eqiad.wmnet with reason: host reimage [20:21:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P71245 and previous config saved to /var/cache/conftool/dbconfig/20241127-202143-ladsgroup.json [20:24:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2034 (T376905)', diff saved to https://phabricator.wikimedia.org/P71246 and previous config saved to /var/cache/conftool/dbconfig/20241127-202420-ladsgroup.json [20:24:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [20:24:40] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [20:24:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71247 and previous config saved to /var/cache/conftool/dbconfig/20241127-202446-ladsgroup.json [20:31:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71248 and previous config saved to /var/cache/conftool/dbconfig/20241127-203143-ladsgroup.json [20:35:12] (03PS1) 10Andrew Bogott: Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069) [20:36:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T370903)', diff saved to https://phabricator.wikimedia.org/P71249 and previous config saved to /var/cache/conftool/dbconfig/20241127-203650-ladsgroup.json [20:36:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [20:36:55] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:37:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1214.eqiad.wmnet with reason: Maintenance [20:37:23] (03PS2) 10Andrew Bogott: Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069) [20:37:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71250 and previous config saved to /var/cache/conftool/dbconfig/20241127-203724-ladsgroup.json [20:38:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [20:39:06] (03CR) 10Andrew Bogott: [C:03+2] Openstack nova: make a few more read-only endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/1098613 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [20:43:46] (03CR) 10Jforrester: Add CodeMirror to BetaFeaturesAllowList (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098161 (https://phabricator.wikimedia.org/T376735) (owner: 10MusikAnimal) [20:44:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2237 depool (T379813)', diff saved to https://phabricator.wikimedia.org/P71251 and previous config saved to /var/cache/conftool/dbconfig/20241127-204450-ladsgroup.json [20:44:56] T379813: Wikimedia\Rdbms\DBQueryError: Error 1034: Index for table 'wbc_entity_usage' is corrupt; try to repair itFunction: Wikibase\Client\Usage\Sql\EntityUsageTable::queryUsagesQuery: SELECT eu_aspect,eu_entity_id FROM `wbc_entity - https://phabricator.wikimedia.org/T379813 [20:45:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Optimize (T379813) [20:45:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Optimize (T379813) [20:46:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:46:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P71252 and previous config saved to /var/cache/conftool/dbconfig/20241127-204650-ladsgroup.json [20:52:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71253 and previous config saved to /var/cache/conftool/dbconfig/20241127-205238-ladsgroup.json [20:52:41] (03PS1) 10DDesouza: Reader Survey: Undeploy on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) [20:52:43] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [20:55:36] (03PS1) 10Gergő Tisza: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) [20:56:05] (03PS1) 10Gergő Tisza: Use `useformat` query param for device detection or mobile domain (m.) [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) [20:56:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1098622 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [20:56:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 27 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/CentralAuth] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098623 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2100). [21:00:05] arlolra, MatmaRex, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:28] o/ [21:00:57] hi [21:01:04] o/ [21:01:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P71254 and previous config saved to /var/cache/conftool/dbconfig/20241127-210157-ladsgroup.json [21:04:55] FIRING: MaxConntrack: Max conntrack at 93.06% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:05:04] is a deployer needed? [21:05:54] or are folks in the queue able to self-deploy? [21:06:59] cjming: I can self-deploy (also deploy the rest if there's no one else but happy to leave that to you if you are willing) [21:07:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71255 and previous config saved to /var/cache/conftool/dbconfig/20241127-210745-ladsgroup.json [21:07:55] (03PS1) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 [21:08:26] arlolra: do you need a deployer? [21:08:33] MatmaRex: same Q to you? [21:08:44] yes please [21:08:49] cjming: I don't have much experience deploying mediawiki but I'm in the deployment group [21:09:22] ok - how about this -- i'll do the ones for arlolra and MatmaRex and then pass to you tgr? [21:09:46] arlolra: can your backports go out together? [21:09:55] RESOLVED: MaxConntrack: Max conntrack at 93.06% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [21:10:10] (03CR) 10CI reject: [V:04-1] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn) [21:10:16] arlolra: is order important? can i merge your backports and do the config patch first? [21:10:37] The order is important, the config patch should go last [21:10:54] I can try to do mine if everyone has patience [21:11:29] And is around to help pick up the pieces [21:12:01] arlolra: can your backports go out together? [21:12:47] The 1.44.0-wmf.5 patches can go out together, yes [21:13:11] But it might be better to do the revert [21:13:13] Then the bumps [21:13:16] Then the config [21:13:29] got it - ok - i'll do yours first then [21:14:03] Thanks [21:14:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian) [21:14:59] arlolra: bec backports take so long to merge - i'm going to go ahead and merge your bump patches now too [21:15:19] Sounds good [21:15:23] (03CR) 10Clare Ming: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra) [21:15:31] (03CR) 10Clare Ming: [C:03+2] Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra) [21:15:54] (03PS2) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 [21:16:18] (I'm also here if needed!) [21:16:41] (03PS1) 10DDesouza: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) [21:16:44] * cjming thanks cscott [21:17:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T376905)', diff saved to https://phabricator.wikimedia.org/P71256 and previous config saved to /var/cache/conftool/dbconfig/20241127-211704-ladsgroup.json [21:18:10] (03CR) 10CI reject: [V:04-1] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn) [21:18:25] tgr: i'll pass to you when i finish with arlolra's and MatmaRex's patches -- it will probably be at least 30 minutes from now [21:18:42] thx [21:21:02] MatmaRex: i'll manually merge your patch in about 5-10 minutes so it'll be ready for scap backport after the first bunch finish [21:21:20] sure, thanks [21:21:23] (03PS3) 10Dzahn: Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 [21:22:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098617 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:22:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P71257 and previous config saved to /var/cache/conftool/dbconfig/20241127-212252-ladsgroup.json [21:24:48] ugh [21:24:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) (owner: 10DDesouza) [21:24:52] why do we have a login.m.wikimedia.beta.wmflabs.org? [21:25:31] I think it needs some more domain components [21:25:54] it shouldn't exist at all [21:26:06] login.m.wikimedia.org is a DNS lookup error [21:26:34] as in, there is intentionally no mobile variant [21:26:49] but apparently all the special-casing around that breaks on beta [21:27:45] can't have wildcard SSL certs for more that one level,afair [21:28:36] that URL structure is fine in general [21:28:49] the mobile version of en.wikipedia.org is en.m.wikipedia.org etc [21:29:11] en.wikipedia.beta.wmflabs.org / en.m.wikipedia.beta.wmflabs.org on beta [21:29:20] (03CR) 10Clare Ming: [C:03+2] Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [21:29:48] but we don't want a separate mobile domain for loginwiki since the entire point is having a single domain for central session cookies so we don't want to split that by device [21:30:15] but somehow on beta both MediaWiki and the DNS / routing infra think there is a separate mobile login domain [21:30:37] (03Merged) 10jenkins-bot: Revert "Normalize ref html before comparison" [extensions/Cite] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098567 (https://phabricator.wikimedia.org/T380977) (owner: 10C. Scott Ananian) [21:31:00] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a9 [vendor] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098581 (https://phabricator.wikimedia.org/T373035) (owner: 10Arlolra) [21:31:05] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]] [21:31:06] arlolra: do you need to verify the revert or am i good to sync? [21:31:10] T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977 [21:31:18] I can verify it [21:31:28] 1 sec then [21:32:02] (03PS1) 10Andrew Bogott: openstack nova policy: open a few more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1098628 (https://phabricator.wikimedia.org/T380069) [21:32:37] Arlo there's a url in phab and the slack thread [21:33:10] Yup [21:33:23] https://he.wikipedia.org/w/index.php?title=%D7%95%D7%95%D7%90%D7%98%D7%A1%D7%90%D7%A4&uselang=en&useparsoid=1 [21:33:27] (03CR) 10Andrew Bogott: [C:03+2] openstack nova policy: open a few more read-only endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1098628 (https://phabricator.wikimedia.org/T380069) (owner: 10Andrew Bogott) [21:34:21] (but I don't have x-wikimedia-debug on my phone browser so I can't check whether it's fixed) [21:35:14] Surely you can send a header from your phone [21:35:47] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a9 [core] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098583 (https://phabricator.wikimedia.org/T380664) (owner: 10Arlolra) [21:35:48] arlolra: please check revert - up on test servers [21:35:56] Just give me a telnet client to port 80 [21:36:02] cscott: you can set a cookie on mobile [21:36:27] cjming: Thanks, it's working as expected. Please continue [21:36:43] just visit Special:WikimediaDebug [21:36:53] cool - syncing [21:37:06] !log cjming@deploy2002 cjming, cscott: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:37:08] !log cjming@deploy2002 cjming, cscott: Continuing with sync [21:37:08] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [21:37:08] You're going to make me into one of those millennials who does all their hacking from their phone [21:37:09] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1026.eqiad.wmnet with OS bullseye [21:37:11] T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977 [21:37:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1026.eqiad.wmnet with OS bullseye completed: - wdqs1026... [21:38:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T370903)', diff saved to https://phabricator.wikimedia.org/P71258 and previous config saved to /var/cache/conftool/dbconfig/20241127-213759-ladsgroup.json [21:38:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:38:05] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:38:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1216.eqiad.wmnet with reason: Maintenance [21:39:44] (03Merged) 10jenkins-bot: Temporarily restore renamed messages [extensions/DiscussionTools] (wmf/1.44.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1098590 (https://phabricator.wikimedia.org/T372175) (owner: 10Bartosz Dziewoński) [21:40:22] tgr|away: that's a good trick that I didn't know before [21:40:35] !log bking@cumin1002 START - Cookbook sre.hosts.reimage for host wdqs1027.eqiad.wmnet with OS bullseye [21:40:41] I too can now confirm that the canaries look good ;) a little slow [21:40:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10363905 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1002 for host wdqs1027.eqiad.wmnet with OS bullseye [21:40:55] so slow [21:40:58] (03CR) 10Dzahn: [C:03+2] Revert "scap target: ensure scap is installed on host before it is required" [puppet] - 10https://gerrit.wikimedia.org/r/1098625 (owner: 10Dzahn) [21:43:54] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098567|Revert "Normalize ref html before comparison" (T380977)]] (duration: 12m 49s) [21:43:59] T380977: Wikimedia\RemexHtml\TreeBuilder\TreeBuilderError: Setting foreign attributes is not supported - https://phabricator.wikimedia.org/T380977 [21:45:06] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]] [21:45:11] T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035 [21:45:11] T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664 [21:47:17] apparently production MediaWiki also thinks login.m.wikimedia.org exists [21:47:24] how does this not break everything? [21:47:54] Now that you've observed it, it will undoubtedly start breaking everything. [21:48:14] cjming: I'll retract the backports, need to fix mobile domain configuration first [21:48:25] tgr: sounds good [21:52:40] oof - not sure what's happening - presumably scap is still doing it's thing - could it be stuck? [21:53:03] also see https://phabricator.wikimedia.org/T152882 for "misc wikis lack mobile domains" [21:53:25] (03PS1) 10Gergő Tisza: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 [21:53:37] :/ [21:53:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:53:54] (unless someone is willing to +1 ^^ and then I can deploy it in the window) [21:54:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:54:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71259 and previous config saved to /var/cache/conftool/dbconfig/20241127-215407-ladsgroup.json [21:54:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [21:55:10] any SREs around to confirm all is ok? it seems to be stuck at: https://www.irccloud.com/pastebin/5ZI3MnEf/ [21:55:25] cjming: 10 minutes doesn't seem that extreme [21:55:32] really? [21:56:00] i've been trying to cultivate more patience [21:56:08] well if it's stuck at building the images then maybe yes [21:56:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.294s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:56:21] I have seen the sync part take a while [21:56:56] in my experience it's CI that takes the most time -- scap has been pretty zippy lately imho [21:57:16] maybe that logfile contains something useful? [21:59:15] from what i remember hearing, vendor syncs take a while whereas config / single-file syncs are zippy [21:59:28] finally started going again -- hopefully all good [21:59:52] (03CR) 10Bartosz Dziewoński: [C:03+1] Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza) [22:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241127T2200) [22:00:05] received a VO magru page but I think it's just a 24-hour repage from yesterday [22:00:11] (cc jhathaway) [22:00:26] yeah it is, marking it resolved [22:01:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.294s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:01:19] subbu: gtk - thanks [22:02:06] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:02:30] thanks @rzl [22:02:49] damn that muscle memory :( [22:04:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 28 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza) [22:04:31] jhathaway: paging you at 3:59 when the long weekend starts at 4:00 is admittedly pretty funny :) see you, enjoy [22:04:55] :), thanks! [22:06:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71260 and previous config saved to /var/cache/conftool/dbconfig/20241127-220638-ladsgroup.json [22:06:44] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:06:59] (03PS2) 10Cathal Mooney: Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) [22:07:38] (03CR) 10CI reject: [V:04-1] Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [22:07:46] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:07:59] !log bking@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage [22:08:46] (03PS3) 10Cathal Mooney: Example of QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) [22:08:48] arlolra: bump patches should be up on test servers [22:09:06] !log cjming@deploy2002 arlolra, cjming: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:09:11] T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035 [22:09:12] T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664 [22:09:26] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:09:29] cjming: Testing [22:09:38] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:09:41] (i'm still around, if we reach my backport patch today) [22:11:03] !log bking@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1027.eqiad.wmnet with reason: host reimage [22:11:25] MatmaRex: your patch might be too i think [22:11:47] oh. lemme test [22:12:07] I wonder if it made sense to adopt deploying code in batches? it's basically impossible to fit 5-6 patches in a backport window since we switched to k8s / full scap [22:12:21] ^^^ agree [22:12:22] cjming: yep, my thing looks fixed on mwdebug [22:13:25] cool - i think i still need to scap backport yours separately? anyway, i can do that after current batch is done [22:13:40] cjming: please continue [22:13:44] phew! [22:13:46] !log cjming@deploy2002 arlolra, cjming: Continuing with sync [22:13:58] if it's on the testservers, it will be in production as well [22:14:25] if scap complained about an unexpected patch and showed you the diff, it's going to be deployed [22:14:47] got it - so no need to scap backport manually merged patches? [22:16:07] scap backport just merges the patch and then does a git pull and a full scap (and rebase security patches and other fine details like that) [22:16:28] so a merge and then it syncs out the git heads to the servers, basically [22:20:17] when there are several backports in a queue, i end up manually merging stuff to get ahead of CI but i get concerned if anything ever needs reverting [22:21:11] MatmaRex: then yours should be live along with arlolra's backports - hopefully soonish [22:21:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P71261 and previous config saved to /var/cache/conftool/dbconfig/20241127-222145-ladsgroup.json [22:22:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:28] (03CR) 10Gergő Tisza: "@nshahquinn@wikimedia.org FYI (since the comment says to notify you)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (owner: 10Gergő Tisza) [22:26:20] !log bking@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin1002" [22:27:44] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098581|Bump wikimedia/parsoid to 0.21.0-a9 (T373035 T380664)]], [[gerrit:1098583|Bump wikimedia/parsoid to 0.21.0-a9 (T380664)]] (duration: 42m 38s) [22:27:49] T373035: TypeError: Argument 1 passed to Wikimedia\Parsoid\Config\Env::makeTitleFromURLDecodedStr() must be of the type string, int given, called in /vendor/wikimedia/parsoid/src/Wt2Html/DOM/Processors/AddRedLinks.php:90 - https://phabricator.wikimedia.org/T373035 [22:27:50] T380664: CTT tasks week of 2024-11-22 - https://phabricator.wikimedia.org/T380664 [22:27:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian) [22:28:10] arlolra: revert + bumps should be live - doing your config patch no [22:28:14] *now [22:28:29] MatmaRex: yours should be live too [22:28:31] Yay [22:28:36] (03Merged) 10jenkins-bot: Turn on Parsoid Read views on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098572 (https://phabricator.wikimedia.org/T380769) (owner: 10C. Scott Ananian) [22:28:38] yep, i see it. thanks cjming [22:28:45] cjming: thanks [22:29:03] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]] [22:29:08] T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769 [22:29:21] If this all goes off smoothly, we should have a party in two weeks to celebrate. Maybe wmf will fly us all to Barcelona for it. [22:29:37] lol [22:30:28] arlolra has to step out .. so I am around to verify the last bit ... parsoid read view on jawikivoyage. [22:30:46] cool - should be ready shortly [22:30:49] I can also verify from my phone now, thanks to tgr [22:30:52] but looks like cscott also has power to be a cool mobile-testing kid. [22:31:06] and he can now read my mind too it seems with his new found powers. [22:31:24] I'm mostly here as a distraction apparently [22:31:26] Sorry, yes, I have to make it to the pharmacy before it closes but subbu and cscott are here [22:31:31] np! [22:32:07] cjming: thanks for the help and my resolution will be doing my own deploys in the new year [22:32:48] arlolra: yw! good resolution :) [22:34:11] I'll queue up ja.wikivoyage on my phone while I'm waiting [22:34:26] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:04] cscott, subbu: config patch should be up on test servers [22:35:04] cjming, looks like this is already on the test servers. I can see that it rolled out there [22:35:12] yay! [22:35:20] so gtg? [22:35:21] ok, so yes verified that it is rendering properly there with parsoid. [22:35:25] yes. [22:35:30] nice [22:35:40] !log cjming@deploy2002 cscott, cjming: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:35:44] !log cjming@deploy2002 cscott, cjming: Continuing with sync [22:35:45] T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769 [22:36:40] Hey I can confirm it works [22:36:47] woohoo! [22:36:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate mwmaint.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [22:36:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P71262 and previous config saved to /var/cache/conftool/dbconfig/20241127-223652-ladsgroup.json [22:36:53] thanks! [22:39:26] FIRING: [2x] SystemdUnitFailed: prometheus-ethtool-exporter.service on wikikube-worker1256:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:42:40] I updated the village pump notices to let jawikivoyage know that we were able to squeeze in their deploy this week. [22:43:38] (03PS2) 10Gergő Tisza: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) [22:44:26] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098572|Turn on Parsoid Read views on jawikivoyage (T380769)]] (duration: 15m 22s) [22:44:30] subbu, cscott: should be live! [22:44:30] T380769: Deploy Parsoid Read Views to ja wikivoyage (week of 2024-11-27) - https://phabricator.wikimedia.org/T380769 [22:45:36] \o/ [22:46:50] !log end of UTC late backport window [22:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:17] I'll deploy a config fix [22:47:51] oh - sorry - prematurely closed the window [22:48:48] Tgr needs to defenestrate something still. [22:49:17] cscott, i am signing off ... back online in 3 hours if there is anything needed. [22:49:34] 👍 [22:49:35] but available on signal. [22:49:45] Thanks again cjming [22:49:58] ur welcome :) [22:50:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [22:50:47] (03Merged) 10jenkins-bot: Fix mobile domain logic for login.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098633 (https://phabricator.wikimedia.org/T380646) (owner: 10Gergő Tisza) [22:51:16] !log tgr@deploy2002 Started scap sync-world: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]] [22:51:21] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [22:51:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T370903)', diff saved to https://phabricator.wikimedia.org/P71263 and previous config saved to /var/cache/conftool/dbconfig/20241127-225159-ladsgroup.json [22:52:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:52:04] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:52:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:56:47] !log tgr@deploy2002 tgr: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:56:51] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [23:01:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:01:52] (03PS2) 10DDesouza: Reader Survey: Deploy on multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1098627 (https://phabricator.wikimedia.org/T378660) [23:01:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2152.codfw.wmnet with reason: Maintenance [23:02:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71264 and previous config saved to /var/cache/conftool/dbconfig/20241127-230159-ladsgroup.json [23:02:04] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:02:27] !log tgr@deploy2002 tgr: Continuing with sync [23:09:24] !log tgr@deploy2002 Finished scap sync-world: Backport for [[gerrit:1098633|Fix mobile domain logic for login.wikimedia.org (T380646)]] (duration: 18m 07s) [23:09:26] FIRING: [6x] SystemdUnitFailed: wdqs-updater.service on wdqs2018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:28] T380646: Centralize SUL2 and SUL3 device detection - https://phabricator.wikimedia.org/T380646 [23:15:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T370903)', diff saved to https://phabricator.wikimedia.org/P71267 and previous config saved to /var/cache/conftool/dbconfig/20241127-231504-ladsgroup.json [23:15:09] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:30:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P71269 and previous config saved to /var/cache/conftool/dbconfig/20241127-233011-ladsgroup.json [23:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:40:42] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:45:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P71270 and previous config saved to /var/cache/conftool/dbconfig/20241127-234518-ladsgroup.json [23:51:44] (03PS3) 10Tim Starling: Move default main page text for new wikis to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) [23:53:07] (03CR) 10Tim Starling: Move default main page text for new wikis to config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1094126 (https://phabricator.wikimedia.org/T352113) (owner: 10Tim Starling)