[00:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:58] (03PS4) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [00:16:51] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:33:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088000 [00:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088000 (owner: 10TrainBranchBot) [00:43:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [00:57:45] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) (owner: 10Eevans) [01:08:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088002 [01:08:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088002 (owner: 10TrainBranchBot) [01:13:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1088000 (owner: 10TrainBranchBot) [01:16:51] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:22:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:23:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:28:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:36:09] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1088002 (owner: 10TrainBranchBot) [01:43:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:45:58] (03CR) 10Krinkle: "As test I would recommend running the steps that Tim did in https://phabricator.wikimedia.org/T292552#8068291." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [01:46:31] (03CR) 10Krinkle: "Spot check: https://3v4l.org/qef99" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [01:47:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [01:50:38] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 193952 MB (5% inode=91%): /srv/swift-storage/sdd1 183659 MB (4% inode=91%): /srv/swift-storage/sdc1 196684 MB (5% inode=92%): /srv/swift-storage/sdh1 152218 MB (3% inode=90%): /srv/swift-storage/sdi1 196770 MB (5% inode=91%): /srv/swift-storage/sdg1 176975 MB (4% inode=91%): /srv/swift-storage/sdk1 171108 MB (4% inode=91%): /srv/swift-st [01:50:38] j1 184705 MB (4% inode=91%): /srv/swift-storage/sdl1 183003 MB (4% inode=91%): /srv/swift-storage/sde1 154696 MB (4% inode=91%): /srv/swift-storage/sdm1 160314 MB (4% inode=91%): /srv/swift-storage/sdn1 178253 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [01:53:58] PROBLEM - Disk space on thanos-be2004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 194585 MB (5% inode=92%): /srv/swift-storage/sdg1 203329 MB (5% inode=92%): /srv/swift-storage/sdc1 152974 MB (4% inode=90%): /srv/swift-storage/sdh1 186323 MB (4% inode=91%): /srv/swift-storage/sde1 176391 MB (4% inode=91%): /srv/swift-storage/sdd1 162364 MB (4% inode=91%): /srv/swift-storage/sdj1 173683 MB (4% inode=91%): /srv/swift-st [01:53:58] k1 150706 MB (3% inode=90%): /srv/swift-storage/sdi1 181094 MB (4% inode=92%): /srv/swift-storage/sdl1 191221 MB (5% inode=91%): /srv/swift-storage/sdn1 200110 MB (5% inode=91%): /srv/swift-storage/sdm1 197213 MB (5% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [01:58:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:00:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 11 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078764 (https://phabricator.wikimedia.org/T376061) (owner: 10ZhaoFJx) [02:02:20] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdd1 189045 MB (4% inode=91%): /srv/swift-storage/sdc1 159697 MB (4% inode=91%): /srv/swift-storage/sdg1 179146 MB (4% inode=92%): /srv/swift-storage/sdh1 192740 MB (5% inode=91%): /srv/swift-storage/sde1 165054 MB (4% inode=91%): /srv/swift-storage/sdi1 158509 MB (4% inode=90%): /srv/swift-storage/sdj1 179003 MB (4% inode=92%): /srv/swift-st [02:02:20] k1 186716 MB (4% inode=91%): /srv/swift-storage/sdl1 175175 MB (4% inode=91%): /srv/swift-storage/sdm1 186023 MB (4% inode=91%): /srv/swift-storage/sdn1 190351 MB (4% inode=91%): /srv/swift-storage/sdf1 150409 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [02:08:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:48:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) (owner: 10Abijeet Patro) [03:52:15] (03PS3) 10Abijeet Patro: Translate: Enable message bundle Scribunto module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) [04:01:42] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:42] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:13:58] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 200686 MB (5% inode=92%): /srv/swift-storage/sdg1 197294 MB (5% inode=91%): /srv/swift-storage/sdc1 181689 MB (4% inode=91%): /srv/swift-storage/sdi1 176350 MB (4% inode=91%): /srv/swift-storage/sde1 165632 MB (4% inode=91%): /srv/swift-storage/sdh1 163496 MB (4% inode=91%): /srv/swift-storage/sdj1 191196 MB (5% inode=92%): /srv/swift-st [04:13:58] k1 185924 MB (4% inode=91%): /srv/swift-storage/sdd1 152086 MB (3% inode=90%): /srv/swift-storage/sdm1 181974 MB (4% inode=92%): /srv/swift-storage/sdl1 153106 MB (4% inode=90%): /srv/swift-storage/sdn1 173014 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [04:17:18] PROBLEM - Disk space on thanos-be1002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 177062 MB (4% inode=91%): /srv/swift-storage/sdc1 152117 MB (3% inode=90%): /srv/swift-storage/sdf1 172239 MB (4% inode=91%): /srv/swift-storage/sdd1 185475 MB (4% inode=91%): /srv/swift-storage/sdg1 169846 MB (4% inode=91%): /srv/swift-storage/sdh1 173808 MB (4% inode=91%): /srv/swift-storage/sdi1 203061 MB (5% inode=92%): /srv/swift-st [04:17:18] j1 178705 MB (4% inode=92%): /srv/swift-storage/sdk1 153440 MB (4% inode=91%): /srv/swift-storage/sdm1 159192 MB (4% inode=91%): /srv/swift-storage/sdn1 164621 MB (4% inode=91%): /srv/swift-storage/sdl1 153103 MB (4% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1002&var-datasource=eqiad+prometheus/ops [05:43:20] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:54:04] PROBLEM - Disk space on thanos-be2001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdg1 183039 MB (4% inode=92%): /srv/swift-storage/sdd1 189440 MB (4% inode=91%): /srv/swift-storage/sdc1 174622 MB (4% inode=91%): /srv/swift-storage/sdf1 171883 MB (4% inode=91%): /srv/swift-storage/sdh1 152913 MB (4% inode=90%): /srv/swift-storage/sdi1 152176 MB (3% inode=90%): /srv/swift-storage/sde1 187921 MB (4% inode=92%): /srv/swift-st [05:54:04] j1 196198 MB (5% inode=91%): /srv/swift-storage/sdk1 176974 MB (4% inode=91%): /srv/swift-storage/sdm1 160273 MB (4% inode=91%): /srv/swift-storage/sdl1 181927 MB (4% inode=92%): /srv/swift-storage/sdn1 166871 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2001&var-datasource=codfw+prometheus/ops [06:20:22] (03PS1) 10KartikMistry: Update MinT to 2024-10-16-065051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088148 [06:37:18] quick deploy for MinT. [06:37:44] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-10-16-065051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088148 (owner: 10KartikMistry) [06:39:00] (03Merged) 10jenkins-bot: Update MinT to 2024-10-16-065051-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088148 (owner: 10KartikMistry) [06:39:26] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:44:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [06:44:39] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:47:16] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:49:21] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [06:55:17] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T0700) [07:00:05] marostegui, Amir1, and arnaudb: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T0700). [07:03:31] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:10:40] PROBLEM - Disk space on thanos-be1004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 176102 MB (4% inode=91%): /srv/swift-storage/sdc1 155725 MB (4% inode=90%): /srv/swift-storage/sdh1 154897 MB (4% inode=90%): /srv/swift-storage/sdd1 151849 MB (3% inode=90%): /srv/swift-storage/sdf1 167554 MB (4% inode=91%): /srv/swift-storage/sdg1 201451 MB (5% inode=92%): /srv/swift-storage/sdi1 165478 MB (4% inode=91%): /srv/swift-st [07:10:40] j1 172971 MB (4% inode=92%): /srv/swift-storage/sdl1 179130 MB (4% inode=92%): /srv/swift-storage/sdk1 175702 MB (4% inode=91%): /srv/swift-storage/sdm1 185465 MB (4% inode=91%): /srv/swift-storage/sdn1 179973 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [07:18:22] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:19:09] (03PS1) 10Ryan Kemper: wdqs: remove 5 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) [07:19:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:21:05] (03PS1) 10Ryan Kemper: [WIP] create wdqs-internal-main role [puppet] - 10https://gerrit.wikimedia.org/r/1088210 [07:22:34] (03CR) 10Ryan Kemper: [WIP] create wdqs-internal-main role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (owner: 10Ryan Kemper) [07:23:25] ah. Seems MinT deployment failed for eqiad. Logs aren't useful. [07:23:35] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10298483 (10ayounsi) >>! In T364092#10296958, @akosiaris wrote: >> Upgrades should follow the standard process > > The standard process docs are outdated I fear. > >> Depool site... [07:25:02] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:25:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1045.eqiad.wmnet to cluster eqiad and group B [07:25:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1045.eqiad.wmnet to cluster eqiad and group B [07:25:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1045.eqiad.wmnet to cluster eqiad and group C [07:27:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1045.eqiad.wmnet to cluster eqiad and group C [07:27:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1046.eqiad.wmnet to cluster eqiad and group C [07:28:56] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1046.eqiad.wmnet to cluster eqiad and group C [07:31:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [07:38:58] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:39:18] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10298486 (10MoritzMuehlenhoff) [07:49:57] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:50:01] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:50:02] T378068: pc1017 crashed - https://phabricator.wikimedia.org/T378068 [07:50:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:50:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on pc1017.eqiad.wmnet with reason: T378068, host is not pooled [07:51:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [07:56:51] hello [07:57:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T0800). nyaa~ [08:00:05] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:59] holda [08:02:01] hold [08:02:06] typos!! [08:03:12] abijeet: around? [08:03:23] yea [08:04:33] Let's deploy. [08:04:58] (03PS1) 10KartikMistry: Enable Section Translation in ann, iba, nr and, tdd Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088215 (https://phabricator.wikimedia.org/T371420) [08:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) (owner: 10Abijeet Patro) [08:05:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:05:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [08:05:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:06:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [08:06:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70977 and previous config saved to /var/cache/conftool/dbconfig/20241107-080618-arnaudb.json [08:06:21] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:06:22] (03Merged) 10jenkins-bot: Translate: Enable message bundle Scribunto module on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087914 (https://phabricator.wikimedia.org/T359918) (owner: 10Abijeet Patro) [08:06:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088215 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [08:07:30] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1087914|Translate: Enable message bundle Scribunto module on testwiki (T359918)]] [08:07:33] T359918: Lua interface for convenient access to translations in a message bundle - https://phabricator.wikimedia.org/T359918 [08:11:57] eh. Wikitech login issues :/ [08:13:08] kart_: have you logged in since the SUL migration ? [08:13:09] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1087914|Translate: Enable message bundle Scribunto module on testwiki (T359918)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:12] T359918: Lua interface for convenient access to translations in a message bundle - https://phabricator.wikimedia.org/T359918 [08:13:56] abijeet: Please test! [08:14:12] kart_, ok [08:14:18] RhinosF1: it seems. Now, I can't loging using any two-factor. Reading more on that later.. [08:16:04] kart_: see https://wikitech.wikimedia.org/wiki/Wikitech/SUL-migration#What_You_Should_Do when you have time [08:17:03] Thanks! [08:18:27] kart_, looks ok! [08:19:14] abijeet: nice. deploying! [08:19:19] !log kartik@deploy2002 kartik, abi: Continuing with sync [08:20:29] (03CR) 10Muehlenhoff: [C:03+2] Remove incorreclty used system::role [puppet] - 10https://gerrit.wikimedia.org/r/1083783 (owner: 10Muehlenhoff) [08:21:50] It seems mwdebug2001/2002 were not responding as per scap. [08:22:01] (03PS5) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [08:22:23] 08:12:07 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deploy1003.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'deploy2002.codfw.wmnet'] (ran as mwdeploy@mwdebug2002.codfw.wmnet) returned [255]: Timeout, server mwdebug2002.codfw.wmnet not responding. [08:22:23] 08:12:07 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'deploy1003.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'deploy2002.codfw.wmnet'] (ran as mwdeploy@mwdebug2001.codfw.wmnet) returned [255]: Timeout, server mwdebug2001.codfw.wmnet not responding. [08:22:44] (03CR) 10Brouberol: "I confirm that even with all the removed environment variables, I'm still able to successfully run" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [08:22:52] <_joe_> kart_: uh [08:23:21] <_joe_> kart_: the server seems up and happy [08:23:30] <_joe_> mwdebug2001/2002 [08:23:37] sad scap :/ [08:24:07] <_joe_> kart_: not sure why mwdebug would appear *after* the sync to testservers [08:24:56] My bad. [08:24:56] 08:12:07 sync-testservers: 100% (in-flight: 0; ok: 2; fail: 2; left: 0) [08:24:56] 08:12:07 Per-host sync duration: average 26.9s, median 27.7s [08:24:56] 08:12:07 rsync transfer: average 182,383 bytes/host, total 729,532 bytes [08:24:56] 08:12:07 2 testservers had sync errors [08:25:06] <_joe_> oh ok [08:25:10] _joe_: ^ logs after that [08:25:14] <_joe_> so it was during the test server sync [08:25:16] <_joe_> ok [08:25:24] <_joe_> kart_: it's not a blocker, btw [08:25:42] yes, abijeet was able to test though. [08:25:53] <_joe_> !log runing scap pull on mwdebug2001/2002 [08:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:09] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087914|Translate: Enable message bundle Scribunto module on testwiki (T359918)]] (duration: 18m 39s) [08:26:17] T359918: Lua interface for convenient access to translations in a message bundle - https://phabricator.wikimedia.org/T359918 [08:26:43] (03CR) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [08:26:48] abijeet: can you please test again without mwdebug if everything is fine? [08:27:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:27:42] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:27:44] FIRING: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:27:52] kart_, ok [08:28:00] <_joe_> uh [08:28:30] <_joe_> looks like more of the same we had yesterday? [08:28:43] kart_, looks good [08:29:53] hmmm cache_text and full poolcounter queues? [08:30:17] <_joe_> akosiaris: exact same thing as yestrerday I'd say [08:30:26] <_joe_> my requestctl rule didn't cover everything I guess [08:30:50] looks like it's going to send recoveries soon [08:31:22] <_joe_> yes [08:31:30] <_joe_> dunno why the requestctl rule didn't kick in [08:32:13] PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1007:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [08:32:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [08:32:57] RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [08:34:49] ms-be2083 flapping is me testing uefi [08:36:23] PROBLEM - Host ms-be2083 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:41] RECOVERY - Host ms-be2083 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [08:38:30] (03PS1) 10Muehlenhoff: kafka::certificate: Remove non-PKI code paths [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) [08:40:44] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2081.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [08:41:24] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [08:48:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [08:50:20] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for - https://phabricator.wikimedia.org/T379225#10298651 (10Arrbee) Also endorsing an approval for this request as Purity's manager. [08:52:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [08:54:14] (03PS1) 10Slyngshede: Permission list: remove debugging info. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088227 [08:56:45] (03PS2) 10Slyngshede: Permission list: remove debugging info. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088227 [08:59:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [09:00:05] jnuche and dduvall: That opportune time for a MediaWiki train - Utc-0+Utc-7 Version deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T0900). [09:00:28] hi, train rolling forward in ~5m [09:00:36] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1088227 (owner: 10Slyngshede) [09:01:21] (03CR) 10Slyngshede: [C:03+2] Permission list: remove debugging info. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088227 (owner: 10Slyngshede) [09:02:17] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Pwaigi - https://phabricator.wikimedia.org/T379225#10298673 (10KCVelaga_WMF) [09:03:40] (03Merged) 10jenkins-bot: Permission list: remove debugging info. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088227 (owner: 10Slyngshede) [09:06:07] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088228 (https://phabricator.wikimedia.org/T375661) [09:06:09] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088228 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [09:06:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70978 and previous config saved to /var/cache/conftool/dbconfig/20241107-090643-arnaudb.json [09:06:47] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:06:51] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088228 (https://phabricator.wikimedia.org/T375661) (owner: 10TrainBranchBot) [09:10:45] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1087531 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [09:14:08] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.2 refs T375661 [09:14:11] T375661: 1.44.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T375661 [09:14:27] (03CR) 10Muehlenhoff: "PCC error can be ignored, the nodes in deployment-prep were already failing before" [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [09:21:26] !log uploaded openjdk-8 8u412-ga-1~deb11u1 to apt.wikimedia.org for bookworm-wikimedia [09:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:39] !log installing openjdk-8 security updates [09:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70979 and previous config saved to /var/cache/conftool/dbconfig/20241107-092150-arnaudb.json [09:29:10] !log upload liberica 0.4 to apt.wm.o (bookworm-wikimedia) [09:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:55] (03PS3) 10Vgutierrez: liberica: Harden healthcheck systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087939 (https://phabricator.wikimedia.org/T378341) [09:32:56] (03PS3) 10Vgutierrez: liberica: Harden cp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087941 (https://phabricator.wikimedia.org/T378341) [09:32:57] (03PS2) 10Vgutierrez: liberica: Harden fp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087944 (https://phabricator.wikimedia.org/T378341) [09:33:44] (03CR) 10Vgutierrez: [C:03+2] liberica: Harden healthcheck systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087939 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [09:33:45] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: SystemdUnitFailed (instance ganeti-test2003:9100) - https://phabricator.wikimedia.org/T379233 (10LSobanski) 03NEW [09:35:19] (03CR) 10Vgutierrez: [C:03+2] liberica: Harden cp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087941 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [09:36:54] (03CR) 10Elukey: [V:03+1] tlsproxy::localssl: allow multiple listens for tls ports (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [09:36:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P70980 and previous config saved to /var/cache/conftool/dbconfig/20241107-093657-arnaudb.json [09:38:43] (03CR) 10Vgutierrez: [C:03+2] liberica: Harden fp systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1087944 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [09:41:02] (03CR) 10Btullis: airflow: render the spark/hadoop/hdfs/yarn configuration files (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:41:02] (03CR) 10Btullis: [C:03+1] airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) (owner: 10Brouberol) [09:41:25] (03PS1) 10MVernon: admin: add Deepesha Burse to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1088239 (https://phabricator.wikimedia.org/T378182) [09:41:51] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2081.codfw.wmnet with OS bullseye [09:43:33] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:48:09] (03CR) 10Vgutierrez: kafka::certificate: Remove non-PKI code paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [09:48:35] (03PS2) 10Fabfur: hiera: moving haproxykafka common keys to profile [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) [09:50:10] (03PS1) 10Fabfur: hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) [09:50:28] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [09:51:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [09:51:42] (03CR) 10Joal: [C:03+1] refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) (owner: 10Gmodena) [09:51:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10298810 (10ops-monitoring-bot) Draining ganeti1009.eqiad.wmnet of running VMs [09:52:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P70981 and previous config saved to /var/cache/conftool/dbconfig/20241107-095205-arnaudb.json [09:52:08] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:52:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10298816 (10MoritzMuehlenhoff) [09:55:33] (03PS1) 10Muehlenhoff: Add ganeti1047/1048 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1088246 (https://phabricator.wikimedia.org/T378921) [09:55:37] (03PS1) 10Giuseppe Lavagetto: Release new version [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1088247 [09:55:50] (03CR) 10Vgutierrez: [C:03+1] kafka::certificate: Remove non-PKI code paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [09:56:13] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release new version [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1088247 (owner: 10Giuseppe Lavagetto) [09:57:42] !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Add rw interface (still disabled), search - oblivian@cumin2002" [09:57:47] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Add rw interface (still disabled), search - oblivian@cumin2002 [09:58:21] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Add rw interface (still disabled), search - oblivian@cumin2002 [09:58:23] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Add rw interface (still disabled), search - oblivian@cumin2002" [10:00:01] (03CR) 10Vgutierrez: [C:04-1] "this is not enough, please see https://puppet-compiler.wmflabs.org/output/1088244/4463/, you need to provide all the hiera keys used on pr" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [10:02:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [10:05:56] (03PS1) 10Elukey: sre.hosts.reimage: disable UEFI boot override after d-i [cookbooks] - 10https://gerrit.wikimedia.org/r/1088252 [10:07:30] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2081.codfw.wmnet with OS bullseye [10:08:33] (03CR) 10Elukey: [C:03+1] kafka::certificate: Remove non-PKI code paths [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [10:13:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088239 (https://phabricator.wikimedia.org/T378182) (owner: 10MVernon) [10:16:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10298867 (10MatthewVernon) @thcipriani can you approve this request to join the deployment shell group please? [10:17:22] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10298868 (10MatthewVernon) [10:18:33] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [10:18:41] (03CR) 10MVernon: [C:03+2] admin: add Deepesha Burse to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/1088239 (https://phabricator.wikimedia.org/T378182) (owner: 10MVernon) [10:20:34] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:20:40] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:21:27] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2081.codfw.wmnet with reason: host reimage [10:29:00] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/nda for Deepesha Burse WMDE - https://phabricator.wikimedia.org/T378182#10298873 (10MatthewVernon) 05Stalledβ†’03Resolved a:03MatthewVernon Hi @Deepesha_WMDE this is all done for you now. [10:33:24] !log jmm@cumin2002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [10:37:00] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: BFD won't esablish between QFX in VRF and host from IPv6 link-local - https://phabricator.wikimedia.org/T374379#10298883 (10ayounsi) If it's a bug on the switch it's probably worth opening a JTAC ticket. Even if it's not fixed on time for u... [10:40:02] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:40:09] !log gmodena@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-dump-rev-content-reconcile-enrich: apply [10:40:13] (03PS1) 10MVernon: admin: add no-krb,no-ssh analytics-privatedate-users pwaigi1- [puppet] - 10https://gerrit.wikimedia.org/r/1088256 (https://phabricator.wikimedia.org/T379225) [10:40:23] !log elukey@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [10:41:03] !log elukey@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [10:41:04] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2081.codfw.wmnet with OS bullseye [10:41:04] (03CR) 10CI reject: [V:04-1] admin: add no-krb,no-ssh analytics-privatedate-users pwaigi1- [puppet] - 10https://gerrit.wikimedia.org/r/1088256 (https://phabricator.wikimedia.org/T379225) (owner: 10MVernon) [10:46:34] Lucas_WMDE: We will have to do the schema change for https://phabricator.wikimedia.org/T367856 on a normal basis - our test didn't work. So it will affect all s8 wikireplicas which may get around 8-10 days of lag. I am not starting it tomorrow as I am reverting the change on another host and it will take around 2 days to complete. So probably it will be started on monday [10:46:44] (03PS2) 10MVernon: admin: add pwaigi1- to analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1088256 (https://phabricator.wikimedia.org/T379225) [10:47:46] (03Abandoned) 10MVernon: admin: add pwaigi1- to analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1088256 (https://phabricator.wikimedia.org/T379225) (owner: 10MVernon) [10:48:34] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:51:38] (03PS3) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [10:54:24] (03CR) 10CI reject: [V:04-1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [10:55:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [10:56:31] (03PS1) 10Santiago Faci: MPIC: Deploying v0.3.0 on staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088261 (https://phabricator.wikimedia.org/T369912) [10:57:04] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to analytics-privatedata-users, wmf for Pwaigi - https://phabricator.wikimedia.org/T379225#10298895 (10MatthewVernon) Hi @PWaigi-WMF you're already in both the analytics-privatedata-users group and the wmf LDAP group, after ticket T315257 (in 20... [10:57:46] (03PS1) 10Santiago Faci: MPIC: Deploying v0.3.0 on production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088262 (https://phabricator.wikimedia.org/T369912) [10:58:58] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1100) [11:01:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [11:02:08] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10298924 (10ops-monitoring-bot) Draining ganeti1010.eqiad.wmnet of running VMs [11:02:09] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088262 (https://phabricator.wikimedia.org/T369912) (owner: 10Santiago Faci) [11:03:12] !log depool liberica on lvs1013 [11:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet [11:05:14] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:09:22] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:09:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [11:09:51] (03PS1) 10Vgutierrez: liberica: Bind prometheus endpoint to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088263 [11:09:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10298937 (10ops-monitoring-bot) Draining ganeti1010.eqiad.wmnet of running VMs [11:10:14] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [11:10:16] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:10:16] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:28] (03CR) 10CI reject: [V:04-1] liberica: Bind prometheus endpoint to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [11:11:07] (03CR) 10Ayounsi: [C:03+1] sre.hosts.reimage: disable UEFI boot override after d-i [cookbooks] - 10https://gerrit.wikimedia.org/r/1088252 (owner: 10Elukey) [11:11:12] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:12:06] (03PS2) 10Vgutierrez: liberica: Bind prometheus endpoint to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088263 [11:13:11] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [11:15:23] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying v0.3.0 on staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088261 (https://phabricator.wikimedia.org/T369912) (owner: 10Santiago Faci) [11:16:23] (03Merged) 10jenkins-bot: MPIC: Deploying v0.3.0 on staging environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088261 (https://phabricator.wikimedia.org/T369912) (owner: 10Santiago Faci) [11:16:51] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:16:59] (03CR) 10Ayounsi: [C:03+1] liberica: Bind prometheus endpoint to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [11:17:17] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:17:39] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:17:45] (03CR) 10Vgutierrez: [C:03+2] liberica: Bind prometheus endpoint to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [11:17:50] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:17:59] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [11:18:58] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [11:19:30] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:19:51] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [11:19:59] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:20:46] (03PS1) 10Vgutierrez: liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 [11:21:26] (03PS2) 10Vgutierrez: liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 [11:21:27] (03CR) 10CI reject: [V:04-1] liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 (owner: 10Vgutierrez) [11:22:22] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088266 (owner: 10Vgutierrez) [11:23:43] (03PS6) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [11:24:04] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/proton: sync [11:24:12] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [11:24:13] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: sync [11:25:07] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/proton: sync [11:25:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10298964 (10jijiki) a:05Clement_Goubertβ†’03jijiki [11:26:13] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: sync [11:26:36] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: sync [11:26:45] (03PS1) 10Muehlenhoff: Bump version to match package version of latest sec release for Java 8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1088267 [11:27:37] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: sync [11:29:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to codfw) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [11:30:44] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10298979 (10Marostegui) >>! In T360029#10246685, @Volans wrote: > Now that we have dbctl support in Spicerack it should be doable to add the s... [11:32:33] (03PS2) 10Fabfur: hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) [11:33:00] (03CR) 10Fabfur: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [11:33:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10298997 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff >>! In T365650#10297670, @Jclark-ctr wrote: > @MoritzMuehlenhoff these whe... [11:35:54] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [11:36:32] (03PS1) 10Vgutierrez: liberica: gobgpd hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088268 [11:36:53] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying v0.3.0 on production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088262 (https://phabricator.wikimedia.org/T369912) (owner: 10Santiago Faci) [11:37:15] (03CR) 10CI reject: [V:04-1] liberica: gobgpd hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (owner: 10Vgutierrez) [11:37:51] (03Merged) 10jenkins-bot: MPIC: Deploying v0.3.0 on production environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088262 (https://phabricator.wikimedia.org/T369912) (owner: 10Santiago Faci) [11:38:04] (03PS2) 10Vgutierrez: liberica: gobgpd hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088268 [11:38:57] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (owner: 10Vgutierrez) [11:43:01] (03CR) 10Muehlenhoff: liberica: gobgpd hardening (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (owner: 10Vgutierrez) [11:43:07] (03PS7) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [11:43:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [11:43:49] (03PS1) 10Jgiannelos: chromium-render: Add cli flag to avoid flooding with crashpad processes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) [11:44:08] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [11:44:23] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [11:45:43] (03CR) 10David Caro: "That should be available already from the secrets repo." [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [11:48:08] (03CR) 10Vgutierrez: liberica: gobgpd hardening (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (owner: 10Vgutierrez) [11:49:21] (03PS8) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [11:49:27] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [11:49:48] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4464/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [11:50:00] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10299063 (10VRiley-WMF) I have opened an SR with Dell 200579927 [11:51:02] (03PS4) 10Gmodena: refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) [11:51:17] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10299070 (10Marostegui) Thank you, we can provide the timing from the previous tickets if that's needed. As it is a recurrent HW error. [11:52:06] (03CR) 10Gmodena: refinery: gobblin: add webrequest_frontend. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) (owner: 10Gmodena) [11:52:41] (03PS3) 10Vgutierrez: liberica: gobgpd hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (https://phabricator.wikimedia.org/T378341) [11:53:04] (03CR) 10Fabfur: [C:03+1] liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 (owner: 10Vgutierrez) [11:53:20] (03CR) 10Vgutierrez: [C:03+2] liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 (owner: 10Vgutierrez) [11:53:28] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#10299074 (10jijiki) [11:57:21] 10SRE-swift-storage, 06serviceops-radar: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959#10299086 (10jijiki) [12:00:11] (03PS9) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:00:24] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:02:12] (03PS1) 10Muehlenhoff: Blacklist nilfs2 [puppet] - 10https://gerrit.wikimedia.org/r/1088273 [12:05:12] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10299118 (10SCherukuwada) Katherine's manager is on leave, skip-level here approving. [12:05:25] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:05:42] (03PS10) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:06:40] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1047/1048 as Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1088246 (https://phabricator.wikimedia.org/T378921) (owner: 10Muehlenhoff) [12:06:41] (03PS11) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:06:50] (03PS1) 10Gmodena: dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) [12:07:03] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:07:25] (03CR) 10Vgutierrez: [C:03+2] liberica: gobgpd hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088268 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [12:10:13] (03PS1) 10KartikMistry: Update recommendation-api to 2024-11-06-190017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T374597) [12:10:20] (03PS12) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:11:49] (03CR) 10Ayounsi: [C:03+1] "Thinking about it now, we also need to allow IPv6." [puppet] - 10https://gerrit.wikimedia.org/r/1088263 (owner: 10Vgutierrez) [12:12:26] (03CR) 10Ayounsi: [C:03+1] liberica: Bind gobgpd 179 to the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/1088266 (owner: 10Vgutierrez) [12:12:52] (03PS2) 10KartikMistry: Update recommendation-api to 2024-11-06-190017-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088276 (https://phabricator.wikimedia.org/T374597) [12:13:07] (03PS2) 10ClΓ©ment Goubert: external_clouds_vendors: Use requestctl apply [puppet] - 10https://gerrit.wikimedia.org/r/1088274 [12:13:16] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:16] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:14:48] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1088278 (owner: 10L10n-bot) [12:15:51] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:16:00] !log repool liberica on lvs1013 [12:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:45] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088280 [12:24:20] marostegui: thanks! I’ve added a note at https://www.wikidata.org/w/index.php?title=Wikidata:Status_updates/Next&diff=prev&oldid=2271340171, feel free to change it if I got something wrong [12:24:35] (I realized that just putting it on the wiki page makes it easier for you to see or edit the announcement than if I copy+paste it here ^^) [12:24:50] Lucas_WMDE: That's perfect, thank you! [12:32:27] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10299241 (10VRiley-WMF) Yes! That would be most helpful! [12:34:47] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10299227 (10jijiki) [12:36:12] !log joal@deploy2002 Started deploy [analytics/refinery@4bec064]: Regular analytics weekly train [analytics/refinery@4bec0640] [12:37:40] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1047 [12:38:47] (03CR) 10Hnowlan: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088271 (https://phabricator.wikimedia.org/T376438) (owner: 10Jgiannelos) [12:39:19] !log jmm@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti1047 [12:40:18] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1047 [12:40:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti1047 [12:42:05] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10299291 (10Marostegui) >>! In T374215#10299241, @VRiley-WMF wrote: > Yes! That would be most helpful! Sure this is what we had: * Feb 2024 server got into pr... [12:43:58] (03PS13) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:44:09] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:44:24] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:47:47] (03PS1) 10Slyngshede: Permission UI: Minor tweaks to permission approval UI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1088286 [12:48:55] (03PS14) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:49:19] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:53:00] !log joal@deploy2002 Finished deploy [analytics/refinery@4bec064]: Regular analytics weekly train [analytics/refinery@4bec0640] (duration: 16m 47s) [12:55:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087927 (https://phabricator.wikimedia.org/T366381) (owner: 10Esanders) [12:55:59] (03PS15) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:56:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [12:56:45] (03PS4) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [12:59:28] (03PS16) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [12:59:28] (03CR) 10CI reject: [V:04-1] P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [12:59:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1300) [13:01:09] (03PS5) 10Slyngshede: P:netbox remove CAS authentication leftovers. [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) [13:03:04] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10299374 (10fnegri) While this is being discussed, who is taking care of maintaining the current installation of wikitech-static? This alert has been firing for more tha... [13:03:41] (03CR) 10David Caro: [V:03+1 C:03+1] "Let me know when you want to deploy this and I'll merge it for you." [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [13:08:36] !log joal@deploy2002 Started deploy [analytics/refinery@4bec064] (thin): Regular analytics weekly train THIN [analytics/refinery@4bec0640] [13:09:22] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4465/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [13:10:48] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4466/co" [puppet] - 10https://gerrit.wikimedia.org/r/1087931 (https://phabricator.wikimedia.org/T371892) (owner: 10Slyngshede) [13:13:39] !log joal@deploy2002 Finished deploy [analytics/refinery@4bec064] (thin): Regular analytics weekly train THIN [analytics/refinery@4bec0640] (duration: 05m 03s) [13:20:03] !log joal@deploy2002 Started deploy [analytics/refinery@4bec064] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4bec0640] [13:23:47] !log joal@deploy2002 Finished deploy [analytics/refinery@4bec064] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4bec0640] (duration: 03m 44s) [13:25:31] (03CR) 10EoghanGaffney: [C:03+1] gerrit: add chown parameter to lfs data rsync, ensure daemon_user is used [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [13:28:10] (03PS17) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [13:28:18] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:34:50] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1047 [13:35:31] (03PS18) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [13:35:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1047 [13:36:50] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti1048 [13:37:33] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:37:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti1048 [13:40:27] (03CR) 10David Caro: [C:03+1] "LGTM thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:41:13] (03CR) 10FNegri: [C:03+2] WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:41:21] (03CR) 10FNegri: [C:03+2] alertmanager: simplify WMCS templates [puppet] - 10https://gerrit.wikimedia.org/r/1087531 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:42:47] (03Merged) 10jenkins-bot: WMCS: split cloudvirt alerts from generic nodes [alerts] - 10https://gerrit.wikimedia.org/r/1084782 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:47:07] (03PS19) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [13:47:43] (03CR) 10CI reject: [V:04-1] openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:48:35] (03PS20) 10Arturo Borrero Gonzalez: openstack: designate: deploy and enable wmcs-nova-fixed-ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [13:49:14] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [13:49:40] (03PS1) 10EoghanGaffney: apt-staging: Update signing key in distributions file to new key [puppet] - 10https://gerrit.wikimedia.org/r/1088292 [13:50:47] PROBLEM - MariaDB Replica SQL: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1677, Errmsg: Column 0 of table wikidatawiki.revision cannot be converted from type int to type bigint(20) unsigned https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:50:51] PROBLEM - MariaDB Replica Lag: s8 on an-redacteddb1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 259174.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:52:19] !log running thanos bucket cleanup on titan1001 - T351927 [13:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:27] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [13:55:23] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/bitu] - 10https://gerrit.wikimedia.org/r/1088286 (owner: 10Slyngshede) [13:56:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1088292 (owner: 10EoghanGaffney) [13:56:51] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10299641 (10MatthewVernon) [13:59:53] (03CR) 10Muehlenhoff: [C:03+2] kafka::certificate: Remove non-PKI code paths [puppet] - 10https://gerrit.wikimedia.org/r/1088222 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1400). [14:00:05] kart_ and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:27] I'm here [14:00:28] I can probably deploy in a few minutes [14:01:34] I'm here. [14:01:46] Lucas_WMDE: I can self deploy my patch. [14:02:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088215 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [14:02:40] kart_: go ahead, I’m in a call after all now [14:03:09] !log joal@deploy2002 Started deploy [airflow-dags/analytics@23bc4ad]: Regular analytics weekly train [airflow-dags/analytics@23bc4ad3] [14:03:21] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Update signing key in distributions file to new key [puppet] - 10https://gerrit.wikimedia.org/r/1088292 (owner: 10EoghanGaffney) [14:03:26] (03Merged) 10jenkins-bot: Enable Section Translation in ann, iba, nr and, tdd Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088215 (https://phabricator.wikimedia.org/T371420) (owner: 10KartikMistry) [14:03:28] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: disable UEFI boot override after d-i [cookbooks] - 10https://gerrit.wikimedia.org/r/1088252 (owner: 10Elukey) [14:03:48] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1088215|Enable Section Translation in ann, iba, nr and, tdd Wikipedias (T371420)]] [14:04:03] T371420: Complete enablement Section Translation in new wikis and make the process less manual for the future - https://phabricator.wikimedia.org/T371420 [14:04:53] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@23bc4ad]: Regular analytics weekly train [airflow-dags/analytics@23bc4ad3] (duration: 01m 44s) [14:04:58] (03CR) 10Ssingh: [C:03+1] hiera: moving haproxykafka common keys to profile [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:05:43] (03CR) 10Elukey: [V:03+1 C:03+2] tlsproxy::localssl: allow multiple listens for tls ports [puppet] - 10https://gerrit.wikimedia.org/r/1087421 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [14:06:25] !log kartik@deploy2002 kartik: Backport for [[gerrit:1088215|Enable Section Translation in ann, iba, nr and, tdd Wikipedias (T371420)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:08:34] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T379256 (10phaultfinder) 03NEW [14:08:58] Lucas_WMDE: Can you do our patch after? [14:08:59] (03PS1) 10Muehlenhoff: cache::kafka::certificate: Remove $use_internal_ca [puppet] - 10https://gerrit.wikimedia.org/r/1088296 (https://phabricator.wikimedia.org/T337825) [14:09:20] !log kartik@deploy2002 kartik: Continuing with sync [14:09:44] edsanders: sure, I can deploy now :) [14:09:46] (once kart_ is done) [14:10:22] (03CR) 10Elukey: [C:03+1] Blacklist nilfs2 [puppet] - 10https://gerrit.wikimedia.org/r/1088273 (owner: 10Muehlenhoff) [14:11:45] (03CR) 10Jforrester: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [14:12:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088296 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [14:12:43] (03CR) 10Jforrester: "Testcommons is a closed wiki (unless you're reversing that)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) (owner: 10Bvibber) [14:13:57] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1088215|Enable Section Translation in ann, iba, nr and, tdd Wikipedias (T371420)]] (duration: 10m 08s) [14:14:05] and I'm done! [14:14:06] T371420: Complete enablement Section Translation in new wikis and make the process less manual for the future - https://phabricator.wikimedia.org/T371420 [14:14:09] Lucas_WMDE: ^ [14:14:11] thanks! [14:14:35] (03PS1) 10Vgutierrez: liberica: Harden hcforwarder systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1088297 (https://phabricator.wikimedia.org/T378341) [14:14:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087927 (https://phabricator.wikimedia.org/T366381) (owner: 10Esanders) [14:15:03] (03CR) 10Btullis: [C:03+2] refinery: gobblin: add webrequest_frontend. [puppet] - 10https://gerrit.wikimedia.org/r/1082434 (https://phabricator.wikimedia.org/T377931) (owner: 10Gmodena) [14:15:16] jouncebot: nowandnext [14:15:16] For the next 0 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1400) [14:15:16] In 1 hour(s) and 44 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600) [14:15:24] (03Merged) 10jenkins-bot: Deploy EditCheck (references) to hiwiki, bnwiki, idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087927 (https://phabricator.wikimedia.org/T366381) (owner: 10Esanders) [14:15:43] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1087927|Deploy EditCheck (references) to hiwiki, bnwiki, idwiki (T366381)]] [14:15:46] T366381: Make Edit Check (references) available to all newcomers at phase 2 Wikipedias - https://phabricator.wikimedia.org/T366381 [14:15:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:16:03] (03CR) 10Muehlenhoff: [C:03+2] Blacklist nilfs2 [puppet] - 10https://gerrit.wikimedia.org/r/1088273 (owner: 10Muehlenhoff) [14:17:12] (03CR) 10Fabfur: [C:03+2] hiera: moving haproxykafka common keys to profile [puppet] - 10https://gerrit.wikimedia.org/r/1087943 (https://phabricator.wikimedia.org/T377931) (owner: 10Fabfur) [14:18:11] !log lucaswerkmeister-wmde@deploy2002 esanders, lucaswerkmeister-wmde: Backport for [[gerrit:1087927|Deploy EditCheck (references) to hiwiki, bnwiki, idwiki (T366381)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:18:21] edsanders: can you test the change? [14:18:29] testing [14:20:01] (03CR) 10Ssingh: [C:03+1] "Looks good; what does the score look like with this applied?" [puppet] - 10https://gerrit.wikimedia.org/r/1088297 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [14:20:39] Lucas_WMDE: lgtm [14:20:42] !log lucaswerkmeister-wmde@deploy2002 esanders, lucaswerkmeister-wmde: Continuing with sync [14:20:45] ok, thanks! [14:24:17] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:25:10] (03CR) 10Ottomata: [C:03+1] dse-k8s-services: mw-dump: version bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena) [14:25:20] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087927|Deploy EditCheck (references) to hiwiki, bnwiki, idwiki (T366381)]] (duration: 09m 37s) [14:25:23] T366381: Make Edit Check (references) available to all newcomers at phase 2 Wikipedias - https://phabricator.wikimedia.org/T366381 [14:25:57] Dreamy_Jazz: I’m done for now if you want to deploy something (though in 35 minutes I want to do some deployments with joelyrookewmde) [14:27:25] FIRING: SystemdUnitFailed: user@0.service on elastic1084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:27:43] (03CR) 10Aude: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [14:29:17] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [14:30:34] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10299735 (10Papaul) [14:30:39] (03CR) 10Vgutierrez: "β†’ Overall exposure level for liberica-hcforwarder.service: 5.0 MEDIUM 😐" [puppet] - 10https://gerrit.wikimedia.org/r/1088297 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [14:32:25] RESOLVED: SystemdUnitFailed: user@0.service on elastic1084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:33:32] (03CR) 10Ssingh: [C:03+1] "Let's merge I guess and then we can re-assess; surely is better than no hardening so +1!" [puppet] - 10https://gerrit.wikimedia.org/r/1088297 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [14:34:17] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:34:43] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10299748 (10Papaul) @ayounsi thanks for the information [14:34:56] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258 (10joanna_borun) 03NEW [14:35:48] (03CR) 10Vgutierrez: [C:03+2] liberica: Harden hcforwarder systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1088297 (https://phabricator.wikimedia.org/T378341) (owner: 10Vgutierrez) [14:38:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:51] (03PS1) 10Fabfur: haproxykafka: systemd service hardening [puppet] - 10https://gerrit.wikimedia.org/r/1088298 (https://phabricator.wikimedia.org/T379237) [14:39:35] 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259 (10joanna_borun) 03NEW [14:41:52] !log installing python-git security updates [14:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:20] (03CR) 10CDanis: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [14:45:48] (03CR) 10Xcollazo: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088275 (https://phabricator.wikimedia.org/T368746) (owner: 10Gmodena) [14:48:53] (03PS1) 10Cwhite: titan: set thanos 5m retention to 5w [puppet] - 10https://gerrit.wikimedia.org/r/1088301 (https://phabricator.wikimedia.org/T351927) [14:54:07] (03CR) 10MVernon: [C:03+1] "LGTM, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1088301 (https://phabricator.wikimedia.org/T351927) (owner: 10Cwhite) [14:54:21] (03CR) 10Cwhite: [C:03+2] titan: set thanos 5m retention to 5w [puppet] - 10https://gerrit.wikimedia.org/r/1088301 (https://phabricator.wikimedia.org/T351927) (owner: 10Cwhite) [14:55:17] jouncebot: nowandnext [14:55:17] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1400) [14:55:17] In 1 hour(s) and 4 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600) [14:55:56] !log Restarted CI Jenkins for plugins update [14:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:27] jouncebot: nowandnext [15:01:28] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [15:01:28] In 0 hour(s) and 58 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600) [15:01:44] joelyrookewmde and I are going to do some deployment training, hopefully enabling Wikidata on one or more client wikis [15:01:52] if that’s alright with everyone else :) [15:03:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:14] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [15:10:37] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@abc27c0] (releasing): (no justification provided) [15:11:26] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@abc27c0] (releasing): (no justification provided) (duration: 00m 52s) [15:14:27] !log jnuche@deploy2002 Started deploy [releng/jenkins-deploy@abc27c0] (releasing): (no justification provided) [15:15:38] !log jnuche@deploy2002 Finished deploy [releng/jenkins-deploy@abc27c0] (releasing): (no justification provided) (duration: 01m 13s) [15:16:51] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2082.codfw.wmnet with OS bullseye [15:18:25] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@6d0b97e]: Add new wikis to RESTBase [15:19:32] (03CR) 10Raymond Ndibe: "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [15:20:54] (03CR) 10Raymond Ndibe: "Yes I think we should get this merged, there is no reason to keep it hanging" [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [15:22:21] (03CR) 10David Caro: [V:03+1 C:03+2] openstack: keystone: fix radosgw 500 errors with Object Storage [puppet] - 10https://gerrit.wikimedia.org/r/1087968 (https://phabricator.wikimedia.org/T360626) (owner: 10Raymond Ndibe) [15:23:44] does anyone know how to make an ad-hoc backup of a (small) database table by any chance? [15:24:06] we’re trying to follow https://wikitech.wikimedia.org/wiki/Add_a_wiki#Wikidata, which says β€œIt's probably also wise to backup these tables from Wikidata and at least one Wikipedia” [15:24:11] but doesn’t explain how :/ [15:24:35] and https://wikitech.wikimedia.org/wiki/MariaDB/Backups doesn’t look too helpful, I think that’s for β€œregular” backups [15:25:11] (03PS21) 10Arturo Borrero Gonzalez: openstack: designate: deploy wmcs_nova_fixed_ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) [15:26:04] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [15:26:12] maybe Amir1 is around? [15:26:34] jynus: ^ ? [15:29:16] Dont worry, it is my job to have backups, you can continue. I guess it would be nice to have a script/automation that does that, but if you break it I will be able to fix it. But please try to not break it. [15:29:23] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [15:29:31] okay, thanks :'D [15:30:15] or, you know, correct the script so it is smarter [15:30:28] well, for that I’d need to know how it’s broken first [15:30:29] but we get you covered [15:30:39] I just have a comment on wikitech saying it’s been known to cause problems [15:30:41] but I’ve never run it myself [15:30:50] I cannot help with that, sorry [15:31:16] !log joelyrookewmde@mwmaint2002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [15:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:55] !log taavi@deploy2002 ~ $ mwscript-k8s migrateUserGroup.php -- --wiki=labswiki contentadmin sysop # T375950 [15:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:59] T375950: fold contentadmin group to sysop in Wikitech - https://phabricator.wikimedia.org/T375950 [15:33:11] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [15:34:17] (03PS1) 10Majavah: wikitech: Drop contentadmin group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088318 (https://phabricator.wikimedia.org/T375950) [15:37:49] (03PS1) 10Elukey: profile::maps::tlsproxy: allow traffic to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1088319 (https://phabricator.wikimedia.org/T378944) [15:39:58] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@6d0b97e]: Add new wikis to RESTBase (duration: 21m 33s) [15:40:42] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4467/co" [puppet] - 10https://gerrit.wikimedia.org/r/1088319 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:47:44] (03PS1) 10Muehlenhoff: Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1088322 (https://phabricator.wikimedia.org/T376790) [15:47:45] (03PS1) 10Muehlenhoff: Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323 [15:48:22] jouncebot: next [15:48:23] In 0 hour(s) and 11 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600) [15:48:49] (03CR) 10Herron: [C:03+1] Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1088322 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [15:51:50] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Kgraessle - https://phabricator.wikimedia.org/T379173#10300219 (10thcipriani) > Reason for access: Access to stat machines in production for query load testing Does this mean machines like [[https://wikitech.wikimedia.org/wiki/Data... [15:52:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet [15:52:33] (03PS1) 10Elukey: sre.hosts.reimage: apply more overrides after d-i for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088324 [15:53:22] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bullseye [15:53:27] !log Finished populateSitesTable for tcywiktionary (T378466) and tcywikisource (T378474) [15:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:48] T378466: Add Wikidata support for tcywiktionary - https://phabricator.wikimedia.org/T378466 [15:53:48] T378474: Add Wikidata support for tcywikisource - https://phabricator.wikimedia.org/T378474 [15:53:56] (03CR) 10Elukey: [C:03+1] cache::kafka::certificate: Remove $use_internal_ca [puppet] - 10https://gerrit.wikimedia.org/r/1088296 (https://phabricator.wikimedia.org/T337825) (owner: 10Muehlenhoff) [15:54:15] (03CR) 10Elukey: [V:03+1 C:03+2] profile::maps::tlsproxy: allow traffic to port 6543 [puppet] - 10https://gerrit.wikimedia.org/r/1088319 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [15:54:16] (03PS2) 10Muehlenhoff: Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1088322 (https://phabricator.wikimedia.org/T376790) [15:54:53] !log remove ganeti1010 from active ganeti nodes T378921 [15:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:56] T378921: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921 [15:56:07] (03CR) 10Muehlenhoff: [C:03+2] Grant access to logstash to cn=logstash-access [puppet] - 10https://gerrit.wikimedia.org/r/1088322 (https://phabricator.wikimedia.org/T376790) (owner: 10Muehlenhoff) [15:56:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10300261 (10MoritzMuehlenhoff) [15:57:27] PROBLEM - ganeti-confd running on ganeti1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [15:57:47] PROBLEM - ganeti-noded running on ganeti1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [15:57:48] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [15:57:52] (03PS6) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [15:58:10] (03CR) 10TChin: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [15:58:31] jouncebot: nowandnext [15:58:32] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [15:58:32] In 0 hour(s) and 1 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600) [15:58:44] (Just checking calendar) [15:59:02] (03PS1) 10Muehlenhoff: Remove ganeti role from to-be-decommed servers [puppet] - 10https://gerrit.wikimedia.org/r/1088325 (https://phabricator.wikimedia.org/T378921) [15:59:07] FIRING: ProbeDown: Service ganeti1010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:00:05] jnuche and dduvall: It is that lovely time of the day again! You are hereby commanded to deploy Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1600). [16:00:33] (03PS1) 10EoghanGaffney: apt-staging: Add flag to import apt packages to repository in the cron job [puppet] - 10https://gerrit.wikimedia.org/r/1088326 [16:01:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10300292 (10Jclark-ctr) [16:02:15] (03CR) 10Giuseppe Lavagetto: [C:03+1] apt-staging: Add flag to import apt packages to repository in the cron job [puppet] - 10https://gerrit.wikimedia.org/r/1088326 (owner: 10EoghanGaffney) [16:02:21] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088327 [16:03:31] (03CR) 10CDanis: external_clouds_vendors: Use requestctl apply (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10ClΓ©ment Goubert) [16:04:18] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [16:04:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10300310 (10Jclark-ctr) 05Openβ†’03Resolved a:05jijikiβ†’03Jclark-ctr @jijiki Hey I received these last night and was able to wrapped this up installs same day but waited til... [16:05:50] (03CR) 10ClΓ©ment Goubert: external_clouds_vendors: Use requestctl apply (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10ClΓ©ment Goubert) [16:06:48] (03CR) 10CDanis: external_clouds_vendors: Use requestctl apply (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10ClΓ©ment Goubert) [16:07:01] (03PS2) 10Elukey: sre.hosts.reimage: apply more overrides after d-i for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1088324 [16:07:02] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:28] * Lucas_WMDE and joelyrookewmde are done deploying btw :) [16:07:35] the maintenance script showed no errors after all \o/ [16:07:48] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [16:09:55] (03PS3) 10Eevans: Update corto puppetization [puppet] - 10https://gerrit.wikimedia.org/r/1087980 (https://phabricator.wikimedia.org/T379204) [16:10:14] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10300359 (10Volans) I think we could put the code directly into the move-vlan cookbook, if the host is present in dbctl, update it. I don't se... [16:12:11] (03PS1) 10Jforrester: RecentChangesTranslationFilterHookHandler: Replace call to deprecated ChangeTags::getDisplayTableName() [extensions/Translate] (wmf/1.44.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1088329 (https://phabricator.wikimedia.org/T379150) [16:12:48] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#10300366 (10Marostegui) Thanks! No, I don't really have any strong opinions on either way, it was mostly to know your thoughts on this. I will... [16:14:00] (03PS3) 10Fabfur: hiera: do not install haproxykafka on cloud instances [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) [16:15:33] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2083.codfw.wmnet with OS bullseye [16:16:02] (03PS6) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [16:17:10] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088244 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [16:18:22] (03CR) 10Majavah: snapshot: Remove labtestwiki from excluded wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1087609 (https://phabricator.wikimedia.org/T378260) (owner: 10Zabe) [16:19:13] (03CR) 10Elukey: "@vgutierrez@wikimedia.org lemme know if the change looks good, in case I'll proceed with the new endpoint :)" [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:19:33] (03PS7) 10Brouberol: airflow: render the spark/hadoop/hdfs/yarn configuration files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087903 (https://phabricator.wikimedia.org/T377928) [16:19:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: designate: deploy wmcs_nova_fixed_ptr [puppet] - 10https://gerrit.wikimedia.org/r/1087521 (https://phabricator.wikimedia.org/T378192) (owner: 10Arturo Borrero Gonzalez) [16:20:41] (03PS1) 10Jforrester: zotero: Switch image from gerrit- to GitLab-hosted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088330 (https://phabricator.wikimedia.org/T374558) [16:21:44] (03CR) 10Arlolra: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088327 (owner: 10PipelineBot) [16:22:02] RESOLVED: ProbeDown: Service ganeti1010:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:37] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup2012 - https://phabricator.wikimedia.org/T371984#10300376 (10Jhancock.wm) 05Openβ†’03Resolved @jcrespo We got the BPU installed on this server.It's showing in the BIOS (attached). You can procee... [16:22:55] (03Merged) 10jenkins-bot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1088327 (owner: 10PipelineBot) [16:23:36] !log arlolra@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:23:53] (03CR) 10Vgutierrez: [V:03+1] "hmm this could trigger some alerts given you're setting the "new" service directly as `state: production`, otherwise it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:24:09] !log arlolra@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:25:19] (03CR) 10Vgutierrez: [V:03+1 C:03+1] Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:25:55] FIRING: MaxConntrack: Max conntrack at 94.34% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:26:49] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 94 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:27:49] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 54 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:28:06] (03CR) 10Elukey: "You are totally right, maybe better lvs_setup ?" [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:28:17] !log elukey@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [16:28:34] !log elukey@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [16:30:55] RESOLVED: MaxConntrack: Max conntrack at 92.47% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:31:46] (03CR) 10Scott French: "Thanks, Timo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [16:32:04] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2083.codfw.wmnet with reason: host reimage [16:33:48] (03PS7) 10Elukey: Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) [16:34:08] (03CR) 10Elukey: "Applied :)" [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:35:00] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:38:54] PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 91 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:38:56] FIRING: MaxConntrack: Max conntrack at 90.59% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:41:34] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:43:24] RECOVERY - SSH on ganeti2042 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:43:26] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [16:43:40] (03CR) 10Vgutierrez: [C:03+1] Create new lvs service kartotherian-k8s-ssl [puppet] - 10https://gerrit.wikimedia.org/r/1087422 (https://phabricator.wikimedia.org/T378944) (owner: 10Elukey) [16:43:56] RESOLVED: MaxConntrack: Max conntrack at 90.5% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [16:44:39] (03PS1) 10Cathal Mooney: Do not configure option 82 insertion for frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/1088337 (https://phabricator.wikimedia.org/T268802) [16:44:55] (03PS1) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [16:44:57] (03PS1) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [16:45:24] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2084.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [16:45:47] (03CR) 10CI reject: [V:04-1] dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [16:45:53] (03PS3) 10ClΓ©ment Goubert: external_clouds_vendors: Use requestctl apply [puppet] - 10https://gerrit.wikimedia.org/r/1088274 [16:46:41] (03PS2) 10Majavah: dynamicproxy: Canocalize IP addresses before comparing [puppet] - 10https://gerrit.wikimedia.org/r/1088339 (https://phabricator.wikimedia.org/T379175) [16:46:48] (03CR) 10CDanis: [C:03+1] "thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10ClΓ©ment Goubert) [16:46:53] !log elukey@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2084.codfw.wmnet with OS bullseye [16:46:58] (03CR) 10ClΓ©ment Goubert: external_clouds_vendors: Use requestctl apply (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088274 (owner: 10ClΓ©ment Goubert) [16:48:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:48:54] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host fransc1001.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:49:34] (03CR) 10FNegri: "LGTM, but I'll let Arturo +1 as he knows much more about the IPv6 plans." [puppet] - 10https://gerrit.wikimedia.org/r/1087951 (owner: 10Majavah) [16:49:54] RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 57 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [16:50:07] 10SRE-tools, 06Infrastructure-Foundations: Outdated cookbooks cleanup - https://phabricator.wikimedia.org/T379259#10300550 (10Volans) p:05Triageβ†’03Medium a:03Volans [16:50:22] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10300548 (10Volans) p:05Triageβ†’03Medium a:03Volans [16:52:02] (03CR) 10Btullis: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [16:53:45] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: corto: update production deployment for project changes - https://phabricator.wikimedia.org/T379204#10300566 (10Eevans) p:05Triageβ†’03Medium [16:54:21] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2083.codfw.wmnet with OS bullseye [16:54:31] !log arlolra@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [16:56:00] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:56:23] (03CR) 10Btullis: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [16:56:32] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Add flag to import apt packages to repository in the cron job [puppet] - 10https://gerrit.wikimedia.org/r/1088326 (owner: 10EoghanGaffney) [16:56:42] !log arlolra@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [16:56:50] !log arlolra@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:57:10] !log arlolra@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:57:14] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2084.codfw.wmnet with OS bullseye [16:57:36] !log elukey@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - elukey@cumin2002" [16:57:37] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bullseye [16:58:52] (03CR) 10RLazarus: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4470/co" [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) (owner: 10Dreamy Jazz) [16:58:56] (03CR) 10Tchanders: "Added a notice for users with local rights only: Ibfa768431c9072e7201070ce29fe18ed6cf4a086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087584 (https://phabricator.wikimedia.org/T356294) (owner: 10Tchanders) [17:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1700). [17:00:05] Dreamy_Jazz: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:07] \o [17:00:11] o/ [17:00:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10300614 (10Papaul) @bking can you please provide us with the row location where you need those hosts racked? Thanks [17:00:36] Dreamy_Jazz: the change LGTM, PCC is happy, I'll go ahead and merge -- will you want to kick off a test run of the jobs, or wait for the first scheduled one? [17:01:01] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:01:31] I don't mind waiting for the first scheduled run, but if we could start it early I could start adding the metrics now. So if you have the time, yes please. [17:01:45] (03CR) 10RLazarus: [V:03+1 C:03+2] Schedule daily runs of WikimediaEvents UpdatePeriodicMetrics.php [puppet] - 10https://gerrit.wikimedia.org/r/1085620 (https://phabricator.wikimedia.org/T375508) (owner: 10Dreamy Jazz) [17:02:11] sure, happy to [17:02:23] I assume you want wikimediaevents-UpdatePeriodicMetrics-per-wiki before wikimediaevents-UpdatePeriodicMetrics-global [17:02:31] !log arlolra@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:03:07] It should not matter for the order in which they are run [17:03:12] cool [17:03:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:03:34] !log arlolra@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:03:45] merged, forcing a puppet run on mwmaint2002 now [17:06:15] (03PS5) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [17:06:19] !log manually run mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-per-wiki # T375508 [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:32] T375508: Temp accounts Grafana Dashboard: Total & Active IP Reveal users - https://phabricator.wikimedia.org/T375508 [17:08:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:25] !log arlolra@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [17:09:07] To clarify, it should not also cause any issues to run the global metrics at the same time as the local ones [17:09:10] !log arlolra@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [17:09:12] (03PS6) 10Aude: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) [17:09:18] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T379256#10300640 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm [17:09:18] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T379256#10300639 (10Jhancock.wm) this was my fault. alert has cleared. [17:09:21] (03CR) 10Aude: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [17:12:09] !log manually run mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-global # T375508 [17:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:12] T375508: Temp accounts Grafana Dashboard: Total & Active IP Reveal users - https://phabricator.wikimedia.org/T375508 [17:12:18] Dreamy_Jazz: ha sorry, didn't see that until per-wiki was finished anyway :0 [17:12:20] *:) [17:12:43] both done, check logs at your convenience and let me know if you need anything [17:12:48] Sure. Thanks. [17:14:36] rzl: Can you remind me how to access these logs? [17:16:02] !log cmooney@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c1a-eqiad [17:16:07] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) [17:16:13] Dreamy_Jazz: on mwmaint2002, try `journalctl -u mediawiki_job_wikimediaevents-UpdatePeriodicMetrics-global` [17:16:44] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [17:17:20] Hmm. Says I do not have the permissions for that. [17:18:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c1a-eqiad [17:19:00] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 174889 MB (4% inode=92%): /srv/swift-storage/sdc1 181618 MB (4% inode=91%): /srv/swift-storage/sdf1 228762 MB (5% inode=91%): /srv/swift-storage/sdg1 201900 MB (5% inode=91%): /srv/swift-storage/sdd1 194669 MB (5% inode=91%): /srv/swift-storage/sde1 190699 MB (5% inode=92%): /srv/swift-storage/sdi1 180593 MB (4% inode=91%): /srv/swift-st [17:19:00] k1 176626 MB (4% inode=92%): /srv/swift-storage/sdj1 188889 MB (4% inode=91%): /srv/swift-storage/sdl1 179330 MB (4% inode=91%): /srv/swift-storage/sdm1 187187 MB (4% inode=91%): /srv/swift-storage/sdn1 152024 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [17:19:18] one sec, I know deployers have the rights to do this but I don't remember how :) [17:21:08] while I'm looking for something authoritative, try it with sudo just to see if it's covered by your sudoer rules [17:21:29] (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) [17:22:17] Hmm. It's asking me for a password to use it. [17:22:22] Any idea what password it uses? [17:22:24] okay never mind [17:22:41] Dreamy_Jazz: the password prompt basically means "you can't do this" [17:23:32] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [17:27:27] Dreamy_Jazz: okay I'll figure out a better answer for next time, at the moment if it's okay with you I can just drop a couple of log files in your homedir to unblock you [17:27:29] !log fnegri@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fnegri@cumin1002" [17:27:39] Yeah. That's fine with me [17:28:03] Seeing the logs isn't super important, but just want to double check what metrics have been collected by this run. [17:29:13] Dreamy_Jazz: ~dreamyjazz/global.txt and ~dreamyjazz/per-wiki.txt on mwmaint2002 [17:29:17] Thanks! [17:29:24] !log fnegri@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fnegri@cumin1002" [17:29:24] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:34:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10300817 (10cmooney) @Jclark-ctr as discussed I believe we should have a load of copper SFPs from T369557. We need one of thes... [17:36:03] (03CR) 10Ayounsi: [C:03+1] Do not configure option 82 insertion for frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/1088337 (https://phabricator.wikimedia.org/T268802) (owner: 10Cathal Mooney) [17:37:57] (03CR) 10Scott French: [C:03+2] changeprop: add per-rule consumer properties in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087542 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:39:05] (03Merged) 10jenkins-bot: changeprop: add per-rule consumer properties in jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087542 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:41:46] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply [17:42:25] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [17:43:53] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [17:44:30] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [17:48:05] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: apply [17:48:34] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [17:51:00] PROBLEM - BGP status on pfw1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:51:58] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:54:23] (03CR) 10Scott French: [C:03+2] changeprop-jobqueue: update to 2024-11-05-170900-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087557 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:55:20] !log fnegri@cumin1002 START - Cookbook sre.hosts.remove-downtime for cloudvirt1063.eqiad.wmnet [17:55:21] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cloudvirt1063.eqiad.wmnet [17:55:24] (03Merged) 10jenkins-bot: changeprop-jobqueue: update to 2024-11-05-170900-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087557 (https://phabricator.wikimedia.org/T356241) (owner: 10Scott French) [17:56:58] PROBLEM - ensure kvm processes are running on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:57:30] !log swfrench@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [17:58:01] !log swfrench@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [17:59:00] RECOVERY - BGP status on pfw1-codfw is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:59:52] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [18:00:05] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1800) [18:00:21] Nothing to do in my window today. [18:00:47] 10ops-eqiad, 06SRE, 06Data-Persistence-SRE, 06DBA, 06DC-Ops: db1246 crashed, doesn't reboot cleanly - https://phabricator.wikimedia.org/T374215#10300939 (10wiki_willy) Thanks for the details @Marostegui, I've escalated this up to our Account Rep, to see if she can help push things along. [18:00:58] RECOVERY - ensure kvm processes are running on cloudvirt1063 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:01:31] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [18:02:36] !/win 28 [18:04:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10300982 (10bking) [18:04:57] (03CR) 10Ottomata: [V:03+1] Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:05:08] (03CR) 10Dzahn: [V:03+1 C:03+2] gerrit: add chown parameter to lfs data rsync, ensure daemon_user is used [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [18:05:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10300999 (10bking) @Papaul Sure, I updated racking proposal above with more details. Sorry for not providing this earlier. Repost... [18:10:01] (03CR) 10Dzahn: [V:03+1 C:03+2] "on gerrit2002: +/usr/bin/rsync -a --chown=gerrit2:gerrit2 .." [puppet] - 10https://gerrit.wikimedia.org/r/1087967 (https://phabricator.wikimedia.org/T338470) (owner: 10Dzahn) [18:11:46] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [18:13:25] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [18:14:31] !log updated changeprop-jobqueue to 2024-11-05-170900-production - T356241 [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:34] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [18:21:28] PROBLEM - BFD status on cr2-magru is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:30] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:32] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Frack eqiad network upgrade: design, installation and configuration - https://phabricator.wikimedia.org/T377381#10301105 (10cmooney) [18:22:12] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:30:27] (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) [18:30:28] (03PS1) 10Majavah: interface::rule: Add missing -6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1088357 [18:30:30] (03CR) 10Btullis: Add airflow connection conf for datahub (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:31:21] (03CR) 10CI reject: [V:04-1] interface::rule: Add missing -6 flag [puppet] - 10https://gerrit.wikimedia.org/r/1088357 (owner: 10Majavah) [18:31:24] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [18:32:30] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1088357 (owner: 10Majavah) [18:36:12] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 12 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) (owner: 10Majavah) [18:36:22] (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Add IPv6 support [puppet] - 10https://gerrit.wikimedia.org/r/1088341 (https://phabricator.wikimedia.org/T379283) [18:37:13] (03PS1) 10Herron: site: add insetup roles to aux-k8s-worker100[45] [puppet] - 10https://gerrit.wikimedia.org/r/1088359 (https://phabricator.wikimedia.org/T378989) [18:38:24] (03PS2) 10Lucas Werkmeister (WMDE): tables-catalog: Add GlobalUsage (globalimagelinks) [puppet] - 10https://gerrit.wikimedia.org/r/1087867 (https://phabricator.wikimedia.org/T363581) [18:38:27] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add GlobalUsage (globalimagelinks) [puppet] - 10https://gerrit.wikimedia.org/r/1087867 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [18:38:29] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add GlobalUsage (globalimagelinks) [puppet] - 10https://gerrit.wikimedia.org/r/1087867 (https://phabricator.wikimedia.org/T363581) (owner: 10Lucas Werkmeister (WMDE)) [18:39:15] (03PS4) 10Abijeet Patro: tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) [18:39:27] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [18:39:31] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add translate_cache table [puppet] - 10https://gerrit.wikimedia.org/r/1082546 (https://phabricator.wikimedia.org/T370265) (owner: 10Abijeet Patro) [18:40:04] (03PS5) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [18:42:13] (03PS2) 10Herron: site: add insetup roles to aux-k8s-worker100[45] [puppet] - 10https://gerrit.wikimedia.org/r/1088359 (https://phabricator.wikimedia.org/T378989) [18:43:40] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2138 to codfw - jhancock@cumin2002" [18:43:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2138 to codfw - jhancock@cumin2002" [18:43:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:43:59] (03CR) 10Herron: [C:03+2] "self-merging to kick off vm installs" [puppet] - 10https://gerrit.wikimedia.org/r/1088359 (https://phabricator.wikimedia.org/T378989) (owner: 10Herron) [18:45:09] (03PS6) 10Abijeet Patro: tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) [18:48:54] (03CR) 10Ladsgroup: [C:03+2] tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [18:49:05] (03CR) 10Ladsgroup: [V:03+2 C:03+2] tables-catalog: Add translate_message_group_subscriptions table [puppet] - 10https://gerrit.wikimedia.org/r/1082549 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [18:50:52] !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-worker1004.eqiad.wmnet [18:50:54] !log herron@cumin1002 START - Cookbook sre.dns.netbox [18:54:23] (03CR) 10Ottomata: [V:03+1] "Well let's abandon this then! Thomas you can use the existent connection." [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:54:27] (03Abandoned) 10Ottomata: Add airflow connection conf for datahub [puppet] - 10https://gerrit.wikimedia.org/r/1085449 (https://phabricator.wikimedia.org/T306896) (owner: 10Ottomata) [18:58:36] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1004.eqiad.wmnet - herron@cumin1002" [18:58:41] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-worker1004.eqiad.wmnet - herron@cumin1002" [18:58:41] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:58:41] !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-worker1004.eqiad.wmnet on all recursors [18:58:44] !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-worker1004.eqiad.wmnet on all recursors [18:59:09] !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1004.eqiad.wmnet - herron@cumin1002" [18:59:14] !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-worker1004.eqiad.wmnet - herron@cumin1002" [19:00:05] jnuche and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1900). [19:00:38] (03CR) 10Dzahn: [C:03+2] vrts: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:00:58] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [19:01:12] 06SRE, 10vm-requests, 07Kubernetes, 13Patch-For-Review: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10301287 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [19:03:09] !log herron@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [19:03:09] !log herron@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host aux-k8s-worker1004.eqiad.wmnet [19:03:25] 06SRE, 10vm-requests, 07Kubernetes, 13Patch-For-Review: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10301303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm executed with errors... [19:04:42] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1055495/4477/vrts1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:05:57] jouncebot: nowandnext [19:05:57] For the next 1 hour(s) and 54 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1900) [19:05:57] In 1 hour(s) and 54 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T2100) [19:06:17] jnuche: dduvall: do you need this slot? [19:06:23] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [19:06:57] looks like group2 happened this morning [19:07:21] cdanis: we don't need it. thanks [19:07:27] thanks! [19:08:35] !log VRTS - switching firewall provider from iptables to nftables [19:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:35] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on vrts2002.codfw.wmnet with reason: nftables [19:10:50] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on vrts2002.codfw.wmnet with reason: nftables [19:11:36] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on vrts1003.eqiad.wmnet with reason: nftables [19:11:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on vrts1003.eqiad.wmnet with reason: nftables [19:18:18] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host vrts1003.eqiad.wmnet [19:19:00] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 176653 MB (4% inode=92%): /srv/swift-storage/sdc1 182897 MB (4% inode=91%): /srv/swift-storage/sdf1 229434 MB (6% inode=91%): /srv/swift-storage/sdg1 202042 MB (5% inode=91%): /srv/swift-storage/sdd1 195064 MB (5% inode=91%): /srv/swift-storage/sde1 189787 MB (4% inode=92%): /srv/swift-storage/sdi1 180681 MB (4% inode=91%): /srv/swift-st [19:19:00] k1 178068 MB (4% inode=92%): /srv/swift-storage/sdj1 189109 MB (4% inode=91%): /srv/swift-storage/sdl1 180467 MB (4% inode=91%): /srv/swift-storage/sdm1 187393 MB (4% inode=91%): /srv/swift-storage/sdn1 152299 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [19:19:43] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on vrts1003.eqiad.wmnet with reason: nftables [19:19:58] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on vrts1003.eqiad.wmnet with reason: nftables [19:23:02] 06SRE, 10vm-requests, 07Kubernetes, 13Patch-For-Review: eqiad: (2x) aux-k8s-worker nodes - https://phabricator.wikimedia.org/T378989#10301395 (10herron) ` ganeti1028:~# gnt-instance console aux-k8s-worker1004.eqiad.wmnet @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST... [19:23:05] !log T379199 πŸ’™cdanis@mwmaint2002.codfw.wmnet ~ πŸ•β˜• mwscript sql.php --wiki=testcommonswiki /srv/mediawiki/php-1.44.0-wmf.2/extensions/JsonConfig/sql/mysql/tables-generated.sql [19:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:08] T379199: Install globaljsonlinks* tables on testcommons for Charts deployment - https://phabricator.wikimedia.org/T379199 [19:30:48] (03PS1) 10Dzahn: hieradata: delete vrts2001.yaml, host was decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/1088362 (https://phabricator.wikimedia.org/T373420) [19:33:22] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:33:27] (03CR) 10Bvibber: [C:03+1] "Looks correct from my end! +1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:33:51] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [19:33:58] (03PS14) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [19:37:08] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [19:37:26] (03CR) 10Jforrester: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:41:16] (03CR) 10Dzahn: [C:03+1] "limiting source sets to CACHES can be separate" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:42:05] (03CR) 10Dzahn: [C:03+2] hieradata: delete vrts2001.yaml, host was decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/1088362 (https://phabricator.wikimedia.org/T373420) (owner: 10Dzahn) [19:42:23] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dummy record for pfw1-eqiad.wikimedia.org - cmooney@cumin1002" [19:42:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add dummy record for pfw1-eqiad.wikimedia.org - cmooney@cumin1002" [19:42:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:33] (03CR) 10Bvibber: Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:43:41] (03CR) 10Dzahn: [V:03+1 C:03+2] "both production hosts (the new ones, vrts2002 and vrts1003) have been rebooted after puppet ran" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [19:47:35] RECOVERY - BFD status on cr2-magru is OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:47:37] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:47:39] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:48:11] (03PS7) 10Bvibber: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:49:15] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:49:31] (03CR) 10Bvibber: [C:03+1] "(added comment warning about known parsoid problems with wmgUseChart at present)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:50:14] (03PS8) 10Bvibber: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:53:47] (03PS9) 10Bvibber: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:54:22] (03CR) 10Jforrester: [C:03+1] Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [19:54:42] jouncebot: nowandnext [19:54:42] For the next 1 hour(s) and 5 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T1900) [19:54:42] In 1 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T2100) [19:59:04] (03CR) 10CDanis: [C:03+1] Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:00:35] (03CR) 10Bvibber: [C:03+1] Enable Chart extension on testwiki and testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:02:40] !log dduvall@deploy2002 Installing scap version "4.122.0" for 209 hosts [20:05:06] (03CR) 10Seddon: [C:03+1] DB config for testcommonswiki deployment for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) (owner: 10Bvibber) [20:07:02] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) (owner: 10Bvibber) [20:09:53] cdanis: fyi i just deployed a new version of scap so ping me if something goes awry [20:09:55] (03Merged) 10jenkins-bot: DB config for testcommonswiki deployment for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087975 (https://phabricator.wikimedia.org/T379199) (owner: 10Bvibber) [20:10:14] ack! I just ran into something silly actually, but it's not the fault of your new version [20:10:15] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1087975|DB config for testcommonswiki deployment for Charts (T379199)]] [20:10:28] (I had my own ~/.ssh/known_hosts with an old host key for gerrit) [20:10:29] T379199: Install globaljsonlinks* tables on testcommons for Charts deployment - https://phabricator.wikimedia.org/T379199 [20:10:45] arguably scap should set a single -oKnownHostsFile when it invokes ssh [20:10:45] woo [20:11:13] cdanis: ah, ok. yeah the biggest change that was deployed is MW flavour (php8.1, etc.) support [20:11:43] so if that fails somehow it'll be during the image build or deployment steps [20:11:47] bvibber: would you like to inspect on the testservers ? [20:12:01] sure [20:12:09] mostly just confirming it doesn't explode :D [20:13:05] !log cdanis@deploy2002 cdanis, bvibber: Backport for [[gerrit:1087975|DB config for testcommonswiki deployment for Charts (T379199)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:48] no explode on test [20:13:52] what's the url for testcommons again? [20:14:06] https://test-commons.wikimedia.org/wiki/Main_Page [20:15:52] got it, nothing exploding yet [20:15:57] !log cdanis@deploy2002 cdanis, bvibber: Continuing with sync [20:16:24] once aude's patch goes out i should be able to see a live Data: namesapce on test-commons.wikimedia.org iirc [20:21:01] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087975|DB config for testcommonswiki deployment for Charts (T379199)]] (duration: 10m 45s) [20:21:09] T379199: Install globaljsonlinks* tables on testcommons for Charts deployment - https://phabricator.wikimedia.org/T379199 [20:21:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:21:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303 (10Khantstop) 03NEW [20:21:57] boo-yah [20:21:58] (03Merged) 10jenkins-bot: Enable Chart extension on testwiki and testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087987 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [20:22:17] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1087987|Enable Chart extension on testwiki and testcommonswiki (T378127)]] [20:22:33] T378127: Enable Chart extension on testwiki and testcommons - https://phabricator.wikimedia.org/T378127 [20:24:13] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop - https://phabricator.wikimedia.org/T379303#10301582 (10OSefu-WMF) Approved as @Khantstop's manager [20:24:42] bvibber: ok should be live on testservers now ptal [20:25:00] !log cdanis@deploy2002 cdanis, aude: Backport for [[gerrit:1087987|Enable Chart extension on testwiki and testcommonswiki (T378127)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:25:24] woohoo [20:25:35] ok confirmed i see jsonconfig's data namespace live on test-commons [20:25:36] (03Abandoned) 10Pppery: Configure new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084932 (https://phabricator.wikimedia.org/T378463) (owner: 10Pppery) [20:25:47] and charts should work on testwiki now right? [20:26:06] (note that a handful of test pages were apparently pre-created and are now hidden by the shadowed namespace, they'll need to be fixed up manually) [20:26:59] Auto-creation of a local account failed: Automatic account creation is not allowed. [20:27:02] on test-commons [20:27:10] so i can't log in to fix pages [20:27:50] so it being a closed wiki is actually a problem? [20:28:05] well if being closed means you can't log in and edit pages to provide data pages then yes [20:28:13] i had been under the impression we were re-opening it? [20:28:18] is that wrong? is it still closed? [20:28:38] I think we might have decided to leave it closed because of something like I think specifically Jdlrobson could edit? [20:28:50] ?? [20:29:00] it works if you don't need an account auto-created there, or something [20:29:23] ah hehe [20:29:28] I think I'm going to proceed with the backport for now [20:29:35] we have half an hour left before the actual backport window [20:29:39] well if we're keeping it closed then i think we should rollback this deployment immediately because we can't do anything useful [20:30:17] let's figure that out ASAP but for now I'd like to get production in line with mediawiki-config HEAD again [20:30:19] !log cdanis@deploy2002 cdanis, aude: Continuing with sync [20:30:29] ok [20:35:20] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1087987|Enable Chart extension on testwiki and testcommonswiki (T378127)]] (duration: 13m 02s) [20:35:23] T378127: Enable Chart extension on testwiki and testcommons - https://phabricator.wikimedia.org/T378127 [20:43:21] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:45:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.38s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:46:11] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [20:46:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10301718 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [20:47:12] if its marked as closed formally, only stewards should have access to edit iirc [20:49:28] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs2026 to codfw - jhancock@cumin2002" [20:49:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wdqs2026 to codfw - jhancock@cumin2002" [20:49:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:50:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2026.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:50:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2027.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [20:50:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.38s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:51:54] 06SRE-OnFire, 10Incident Tooling: corto: have CI build the Debian package - https://phabricator.wikimedia.org/T379305 (10Eevans) 03NEW [20:52:05] 06SRE-OnFire, 10Incident Tooling: corto: have CI build the Debian package - https://phabricator.wikimedia.org/T379305#10301799 (10Eevans) p:05Triageβ†’03Medium [20:53:21] 06SRE-OnFire, 10Incident Tooling: implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467#10301802 (10Eevans) [20:57:50] 06SRE, 06serviceops, 05MediaWiki-backport-deployments, 05Train Deployments: MW script "eval.php" failing during scap operations - https://phabricator.wikimedia.org/T379044#10301824 (10Umherirrender) The last change to eval.php directly was be39a1833251ba15a4447676e5994105e0807259, it is a comment only chan... [20:59:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:59:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:59:47] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241107T2100). [21:00:05] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:18] present and ready to self deploy [21:00:26] go ahead! :) [21:00:30] it's a config change, so should be quick [21:00:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) (owner: 10Scardenasmolinar) [21:01:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2026.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:01:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2027.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:02:23] (03Merged) 10jenkins-bot: Enable AutoModerator on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1084883 (https://phabricator.wikimedia.org/T378343) (owner: 10Scardenasmolinar) [21:02:41] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1084883|Enable AutoModerator on viwiki (T378343)]] [21:02:44] T378343: Enable AutoModerator on Vietnamese Wikipedia (viwiki) - https://phabricator.wikimedia.org/T378343 [21:03:05] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [21:03:05] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2128 to codfw - jhancock@cumin2002" [21:03:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2128 to codfw - jhancock@cumin2002" [21:03:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:04:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:23] !log jsn@deploy2002 suecarmol, jsn: Backport for [[gerrit:1084883|Enable AutoModerator on viwiki (T378343)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:32] !log jsn@deploy2002 suecarmol, jsn: Continuing with sync [21:06:54] verified that AutoModerator CC page is available on viwiki via debug host [21:08:22] note that one of the testserver assertions initially failed with a 503, but passed on a rerun [21:09:17] !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [21:11:10] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1084883|Enable AutoModerator on viwiki (T378343)]] (duration: 08m 28s) [21:11:13] T378343: Enable AutoModerator on Vietnamese Wikipedia (viwiki) - https://phabricator.wikimedia.org/T378343 [21:11:35] !log herron@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aux-k8s-worker1004.eqiad.wmnet with OS bookworm [21:11:53] Verified that this is live. All is well and I'm done here. [21:16:10] (03PS1) 10Aude: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) [21:17:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2026'] [21:17:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2027'] [21:17:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2026'] [21:17:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2027'] [21:18:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2026.codfw.wmnet with OS bullseye [21:18:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2027.codfw.wmnet with OS bullseye [21:18:48] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10301921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wdqs2026.codfw.wmn... [21:18:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10301922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host wdqs2027.codfw.wmn... [21:21:55] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ms-be2082.codfw.wmnet with OS bookworm [21:22:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10301955 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm execut... [21:22:27] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:24:33] (03PS2) 10Aude: Reopen testcommonswiki for testing Chart extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) [21:26:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2143 to codfw - jhancock@cumin2002" [21:26:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2143 to codfw - jhancock@cumin2002" [21:26:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:27:00] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:30:35] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:32:03] (03CR) 10Bking: [C:03+2] wdqs: remove 5 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [21:32:12] (03CR) 10Bking: [C:03+1] wdqs: remove 5 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) (owner: 10Ryan Kemper) [21:33:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2158 to codfw - jhancock@cumin2002" [21:34:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2158 to codfw - jhancock@cumin2002" [21:34:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:41:37] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:44:47] (03CR) 10Cathal Mooney: [C:03+2] Do not configure option 82 insertion for frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/1088337 (https://phabricator.wikimedia.org/T268802) (owner: 10Cathal Mooney) [21:46:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2027.codfw.wmnet with reason: host reimage [21:46:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2026.codfw.wmnet with reason: host reimage [21:47:32] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:50:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2027.codfw.wmnet with reason: host reimage [21:50:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2166 to codfw - jhancock@cumin2002" [21:51:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2166 to codfw - jhancock@cumin2002" [21:51:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2026.codfw.wmnet with reason: host reimage [21:53:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:56:10] (03Merged) 10jenkins-bot: Do not configure option 82 insertion for frack switches [homer/public] - 10https://gerrit.wikimedia.org/r/1088337 (https://phabricator.wikimedia.org/T268802) (owner: 10Cathal Mooney) [21:58:09] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2170 to codfw - jhancock@cumin2002" [21:58:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding wikikube-worker2170 to codfw - jhancock@cumin2002" [21:58:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:58:47] (03PS1) 10Cathal Mooney: Add puppet entries for new fundraising switches in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1088373 (https://phabricator.wikimedia.org/T377381) [22:06:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:07:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:07:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2027.codfw.wmnet with OS bullseye [22:07:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10302199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wdqs2027.codfw.wmnet w... [22:08:48] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:10:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10302204 (10Jhancock.wm) [22:10:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10302205 (10Jhancock.wm) [22:10:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:11:19] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10302208 (10Jhancock.wm) [22:12:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10302211 (10Jhancock.wm) [22:12:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [22:12:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2026.codfw.wmnet with OS bullseye [22:12:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10302212 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host wdqs2026.codfw.wmnet w... [22:13:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10302217 (10Jhancock.wm) [22:13:44] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install wdqs202[67] - https://phabricator.wikimedia.org/T378031#10302219 (10Jhancock.wm) 05Openβ†’03Resolved @bking all done! [22:14:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.10.19 - 2024.11.08): Q2:rack/setup/install elastic211[0-5] - https://phabricator.wikimedia.org/T378034#10302214 (10bking) @RobH [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1084253 | CR with site.pp changes ]] has been... [22:14:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2128.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:15:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2129.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:16:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2136.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:17:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2137.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:17:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2138.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:19:22] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:19:57] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2082.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [22:20:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2140.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:20:21] (03PS1) 10Aude: Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) [22:21:00] (03CR) 10CI reject: [V:04-1] Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [22:21:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2141.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:22:14] !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2082.codfw.wmnet with OS bookworm [22:22:15] (03CR) 10Bvibber: [C:03+1] "looks right to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [22:22:22] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10302242 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm [22:22:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:23:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2143.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2144.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2128.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:24:59] (03PS2) 10Aude: Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) [22:25:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2145.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:25:52] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2156.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:26:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2129.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:27:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2157.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:27:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2136.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:27:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2158.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:27:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2137.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:28:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2138.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:28:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2159.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:29:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2160.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2161.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2139.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2140.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:30:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2162.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:31:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2163.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:32:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2164.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:33:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2165.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:33:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2142.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:34:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2143.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:34:27] !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:34:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2166.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:34:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2144.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2167.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:35:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2145.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:36:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2168.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2169.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2156.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:29] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2082.codfw.wmnet with reason: host reimage [22:37:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-worker2170.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:37:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2157.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:38:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2158.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:39:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2159.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:39:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2141.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:40:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2160.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:41:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2161.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:41:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2162.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:42:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2163.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:43:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10302257 (10VRiley-WMF) This has been added to the unit. Please test when possible. [22:43:44] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2164.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:44:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2165.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:45:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2166.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:46:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2167.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:47:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2168.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:47:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2169.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:48:25] (03CR) 10Jdlrobson: [C:03+1] Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [22:48:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2170.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:49:09] (03PS2) 10Ryan Kemper: wdqs: remove 5 codfw hosts from production [puppet] - 10https://gerrit.wikimedia.org/r/1088185 (https://phabricator.wikimedia.org/T376150) [22:49:09] (03PS2) 10Ryan Kemper: [WIP] create wdqs-internal-main role [puppet] - 10https://gerrit.wikimedia.org/r/1088210 [22:49:53] (03CR) 10CI reject: [V:04-1] [WIP] create wdqs-internal-main role [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (owner: 10Ryan Kemper) [22:51:51] (03PS3) 10Aude: Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) [22:55:26] (03PS3) 10Ryan Kemper: wdqs: create wdqs-internal-[main,scholarly] roles [puppet] - 10https://gerrit.wikimedia.org/r/1088210 (https://phabricator.wikimedia.org/T379329) [22:56:44] (03CR) 10Jdlrobson: [C:03+1] Enable Tabular data for test commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088375 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:00:40] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2082.codfw.wmnet with OS bookworm [23:00:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10302320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host ms-be2082.codfw.wmnet with OS bookworm comple... [23:03:39] (03PS1) 10Ryan Kemper: wdqs: new pybal pools for internal graph split [puppet] - 10https://gerrit.wikimedia.org/r/1088383 (https://phabricator.wikimedia.org/T379330) [23:06:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-be208[1-8] - https://phabricator.wikimedia.org/T371400#10302344 (10jhathaway) @elukey I tried reproducing the double Debian installer bug, but I failed, the steps I tried. 1. UEFI reimage, just to confirm exi... [23:08:08] 06SRE-OnFire, 10Incident Tooling: Corto: Scrutinize/finalize template text - https://phabricator.wikimedia.org/T376941#10302347 (10Eevans) I took a stab at this, the (rendered) result of which can be seen [[ https://phabricator.wmcloud.org/T128 | here ]] & [[ https://docs.google.com/document/d/1jRPE-qLt7Xy6zj8... [23:10:14] (03CR) 10Jforrester: "We only finally shut this last February (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/893058), despite promises to RelEn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:11:51] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:12:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:13:23] PROBLEM - Host kubernetes1030 is DOWN: PING CRITICAL - Packet loss = 100% [23:13:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T378916#10302371 (10Jclark-ctr) 05Openβ†’03Resolved a:03Jclark-ctr Rebalanced pdu [23:14:50] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10302375 (10Jclark-ctr) Addressed the ports in B7 with Valerie corrected the issues [23:15:13] RECOVERY - Host kubernetes1030 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [23:15:35] (03PS1) 10Bvibber: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088385 (https://phabricator.wikimedia.org/T378127) [23:20:01] (03CR) 10Aude: [C:03+2] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088385 (https://phabricator.wikimedia.org/T378127) (owner: 10Bvibber) [23:20:08] \o/ [23:20:15] (03CR) 10Seddon: "Yep. I have promised Amir that this will be undone, and made a commitment to Nat Baca that I expect to be held accountable to that we will" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:20:30] (03CR) 10Tim Starling: [C:03+1] Add title-case mapping to support migration to PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [23:20:47] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088385 (https://phabricator.wikimedia.org/T378127) (owner: 10Bvibber) [23:22:43] (03CR) 10Jforrester: "Cool. Thank you! :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088366 (https://phabricator.wikimedia.org/T378127) (owner: 10Aude) [23:34:56] aude: i think we're good to go on that interwiki map update but let's double-check here :D [23:35:12] ok [23:35:41] hey folks any issues with scap'ing that interwiki map update? final tiny bit we need for our extended window [23:36:10] we can wait if neessary but i don't want to leave the repo out of sync [23:36:46] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10302419 (10VRiley-WMF) @colewhite we took a look at logging-hd1005. Everything seems to be correct on our end. Would it be possible to install ipmitool and run 'sudo ipmitool lan print 1' [23:37:20] aude: if you have everything you need to run the scap backport and no objection i think we're ok [23:40:03] we will revert and continue tomorrow [23:40:07] when more folks are around [23:40:24] (03PS1) 10Bvibber: Revert "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088387 [23:40:35] Except tomorrow is a friday [23:40:59] https://wikitech.wikimedia.org/wiki/Deployments#Friday,_November_8 [23:41:23] (03CR) 10Aude: [C:03+2] Revert "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088387 (owner: 10Bvibber) [23:41:42] ok we're reverting that until we're ready to push it at once :D [23:41:49] all good [23:42:02] (03Merged) 10jenkins-bot: Revert "Update interwiki map" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1088387 (owner: 10Bvibber) [23:43:35] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Connect - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:47:57] (03CR) 10Reedy: Add title-case mapping to support migration to PHP 8.1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1087604 (https://phabricator.wikimedia.org/T372603) (owner: 10Scott French) [23:54:14] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T377607#10302489 (10phaultfinder) [23:57:11] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10302490 (10FastLizard4) Looks like this has just happened on Wikimedia-l. Here's a link to the archived message: https://lists.wikimedia.org/hy...