[00:00:42] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1055630 (owner: 10TrainBranchBot) [00:04:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:20] (03PS2) 10XXBlackburnXx: Update nlwiki AbuseFilter config per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 [00:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:06:43] (03PS1) 10Ladsgroup: Stop storing missing-image-alt-text lints [extensions/Linter] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055637 (https://phabricator.wikimedia.org/T370304) [01:06:48] (03CR) 10Ladsgroup: [C:03+2] Stop storing missing-image-alt-text lints [extensions/Linter] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055637 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [01:07:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66872 and previous config saved to /var/cache/conftool/dbconfig/20240722-010745-ladsgroup.json [01:07:51] T369855: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855 [01:07:52] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [01:08:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/Linter] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055637 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [01:09:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:51] (03Merged) 10jenkins-bot: Stop storing missing-image-alt-text lints [extensions/Linter] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055637 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [01:10:12] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1055637|Stop storing missing-image-alt-text lints (T370304)]] [01:13:29] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1055637|Stop storing missing-image-alt-text lints (T370304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:13:34] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:51] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [01:14:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:01] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1055637|Stop storing missing-image-alt-text lints (T370304)]] (duration: 08m 48s) [01:19:05] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [01:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66873 and previous config saved to /var/cache/conftool/dbconfig/20240722-012251-ladsgroup.json [01:22:57] T369855: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855 [01:37:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66874 and previous config saved to /var/cache/conftool/dbconfig/20240722-013756-ladsgroup.json [01:38:02] T369855: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855 [01:38:03] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [01:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:53:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Maint over (T369855 T370304)', diff saved to https://phabricator.wikimedia.org/P66875 and previous config saved to /var/cache/conftool/dbconfig/20240722-015302-ladsgroup.json [01:53:08] T369855: db1179 crashed - hardware issues - https://phabricator.wikimedia.org/T369855 [01:53:08] T370304: Exception caught inside exception handler: Wikimedia\Rdbms\DBUnexpectedError: Database servers in extension1 are overloaded. - https://phabricator.wikimedia.org/T370304 [01:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T367856)', diff saved to https://phabricator.wikimedia.org/P66876 and previous config saved to /var/cache/conftool/dbconfig/20240722-021009-marostegui.json [02:10:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [02:22:25] FIRING: SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P66877 and previous config saved to /var/cache/conftool/dbconfig/20240722-022516-marostegui.json [02:29:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:20] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P66878 and previous config saved to /var/cache/conftool/dbconfig/20240722-024023-marostegui.json [02:44:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:55:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T367856)', diff saved to https://phabricator.wikimedia.org/P66879 and previous config saved to /var/cache/conftool/dbconfig/20240722-025530-marostegui.json [02:55:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [02:55:36] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [02:55:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [02:55:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T367856)', diff saved to https://phabricator.wikimedia.org/P66880 and previous config saved to /var/cache/conftool/dbconfig/20240722-025552-marostegui.json [02:59:20] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:17:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [04:14:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) (owner: 10Tchanders) [04:46:15] (03PS5) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [04:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:53:14] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [04:54:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:24] (03PS1) 10KartikMistry: uzwiki: Limit publishing in CX to 'patroller' and 'sysop' groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) [05:22:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) (owner: 10KartikMistry) [05:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:35] FIRING: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:26] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:28:52] (03PS15) 10Stevemunene: wdqs: add main and scholarly puppet config [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) [06:33:55] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [06:39:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 3.641% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:40:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10001348 (10ayounsi) There is an outstanding diff on the switch for `cloudcephmon1006`. It looks correct, but could DCops double check it and make sure th... [06:44:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 3.641% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:45:37] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:20] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:37] FIRING: [32x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:37] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:20] FIRING: [33x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:59:20] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:07] * kart_ is here and will deploy.. [07:01:42] (03PS1) 10STran: IPInfoHandler: Move token param definition to getBodyParamSettings [extensions/IPInfo] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055771 (https://phabricator.wikimedia.org/T370500) [07:02:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/IPInfo] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055771 (https://phabricator.wikimedia.org/T370500) (owner: 10STran) [07:03:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:59] ☝️ I meant to add that patch to this deployment window and am also here to deploy it [07:06:17] Tran: go ahead. I'm cancelling my patch. [07:10:14] alright thanks. I've got a bit of local troubleshooting to do first so if there's another patch that's going out, feel free. I'll be back in 10. [07:12:13] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [07:12:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [07:13:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10001399 (10ayounsi) Amazing progress ! > Is it a problem with lpxelinux.0, the NIC firmwares interacting with it (say using HTTP etc..) or both? Good... [07:17:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:51] back and starting my backport [07:20:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by stran@deploy1002 using scap backport" [extensions/IPInfo] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055771 (https://phabricator.wikimedia.org/T370500) (owner: 10STran) [07:22:57] (03Merged) 10jenkins-bot: IPInfoHandler: Move token param definition to getBodyParamSettings [extensions/IPInfo] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055771 (https://phabricator.wikimedia.org/T370500) (owner: 10STran) [07:23:22] !log stran@deploy1002 Started scap sync-world: Backport for [[gerrit:1055771|IPInfoHandler: Move token param definition to getBodyParamSettings (T370500)]] [07:23:26] T370500: IP information could not be retrieved - https://phabricator.wikimedia.org/T370500 [07:25:44] !log stran@deploy1002 stran: Backport for [[gerrit:1055771|IPInfoHandler: Move token param definition to getBodyParamSettings (T370500)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:27:06] (03CR) 10Ayounsi: "Thanks, one small addition needed." [homer/public] - 10https://gerrit.wikimedia.org/r/1055543 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [07:30:35] !log stran@deploy1002 stran: Continuing with sync [07:30:46] (03CR) 10Ayounsi: [C:03+1] "lgtm, I can take care of deploying it." [homer/public] - 10https://gerrit.wikimedia.org/r/1055544 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [07:31:49] (03CR) 10Ayounsi: [C:03+1] "lgtm, that was from the time ns2 was part of the esams LVS range." [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [07:32:23] (03CR) 10Ayounsi: border-in: remove authdns filter (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [07:33:15] 06SRE, 10Cassandra: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#10001425 (10Nobleadele) I'm planning to replace five Samsung SSD 850 devices with four 1.6TB Intel or HP SSDs in my data center. Has anyone gone through a similar upgrade? What are t... [07:35:11] (03CR) 10Ayounsi: "Adding one more pairs of eyes to the review just in case 😊" [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [07:35:40] !log stran@deploy1002 Finished scap: Backport for [[gerrit:1055771|IPInfoHandler: Move token param definition to getBodyParamSettings (T370500)]] (duration: 12m 18s) [07:35:44] T370500: IP information could not be retrieved - https://phabricator.wikimedia.org/T370500 [07:39:15] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [07:39:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [07:43:22] done :wave [07:48:36] (03PS1) 10Brouberol: superset: upgrade to v4.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055870 (https://phabricator.wikimedia.org/T370152) [07:49:20] FIRING: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:43] (03PS1) 10Effie Mouzeli: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 [07:59:28] (03PS1) 10Effie Mouzeli: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 [07:59:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:00:30] !log brouberol@cumin1002 START - Cookbook sre.hosts.decommission for hosts karapace1002.eqiad.wmnet [08:03:10] !log brouberol@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts karapace1002.eqiad.wmnet [08:03:19] (03PS1) 10Brouberol: analytics_test_cluster: set schema registry URL to datahub-gms-next [puppet] - 10https://gerrit.wikimedia.org/r/1055873 (https://phabricator.wikimedia.org/T363461) [08:03:44] (03CR) 10Brouberol: [C:03+1] dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:05:52] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main2001.codfw.wmnet with reason: restart attempt [08:06:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main2001.codfw.wmnet with reason: restart attempt [08:06:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 2.549% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:06:29] (03CR) 10Brouberol: [C:03+1] Upgrade airflow test instance version to v2.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054329 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [08:06:44] !log restart kafka on kafka-main2001 - sre.hosts.downtime [08:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:47] uff [08:07:09] !log restart kafka on kafka-main2001 - T370574 [08:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:13] T370574: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574 [08:09:00] (03CR) 10Klausman: [C:03+2] dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:09:58] (03Merged) 10jenkins-bot: dse: Add securityContext to istio components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052701 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:11:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 3.701% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:11:28] ^ having a look [08:19:20] FIRING: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:03] claime: fyi we seem to have load issues on the search clusters since 5am on elastic@codfw and 6am on elastic@eqiad [08:21:43] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055870 (https://phabricator.wikimedia.org/T370152) (owner: 10Brouberol) [08:22:01] (03CR) 10Brouberol: [C:03+2] superset: upgrade to v4.0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055870 (https://phabricator.wikimedia.org/T370152) (owner: 10Brouberol) [08:23:03] (03CR) 10Btullis: [C:03+1] analytics_test_cluster: set schema registry URL to datahub-gms-next [puppet] - 10https://gerrit.wikimedia.org/r/1055873 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:23:33] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [08:24:17] (03CR) 10Jelto: [V:04-1] "PCC fails https://puppet-compiler.wmflabs.org/output/1055496/3331/" [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (owner: 10Dzahn) [08:24:41] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [08:26:21] (03CR) 10Jelto: [V:04-1] "PCC fails https://puppet-compiler.wmflabs.org/output/1055490/3332/" [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (owner: 10Dzahn) [08:28:34] (03CR) 10Elukey: [C:03+2] Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:29:22] (03CR) 10Brouberol: [C:03+2] analytics_test_cluster: set schema registry URL to datahub-gms-next [puppet] - 10https://gerrit.wikimedia.org/r/1055873 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:30:33] (03Merged) 10jenkins-bot: Homer: fix Netbox 4 breaking changes [software/homer] - 10https://gerrit.wikimedia.org/r/1050377 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:30:48] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main2005.codfw.wmnet with reason: restart attempt [08:31:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main2005.codfw.wmnet with reason: restart attempt [08:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:32:26] !log restart kafka on kafka-main2005 - T370574 [08:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:30] T370574: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574 [08:33:46] (03CR) 10Klausman: "Thanks for the review! I will wait with the +2 and automerge until some other rollouts are done. I'll ping you with the remaining patches " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054538 (https://phabricator.wikimedia.org/T365479) (owner: 10Klausman) [08:34:20] RESOLVED: [2x] SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:27] (03PS1) 10Volans: Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) [08:35:16] (03CR) 10Volans: "Mypy tests are failing for me locally for something that seems unrelated." [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [08:35:38] (03PS1) 10KartikMistry: Update cxserver to 2024-07-22-050142-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055876 (https://phabricator.wikimedia.org/T363968) [08:35:39] (03PS2) 10Effie Mouzeli: mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 [08:36:07] (03CR) 10CI reject: [V:04-1] Filter out NaN data from Prometheus [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [08:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.91% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:37:20] (03PS2) 10Effie Mouzeli: thumbor: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 [08:37:35] (03PS3) 10Effie Mouzeli: thumbor: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 [08:39:20] FIRING: SystemdUnitFailed: postgresql@15-main.service on netboxdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:42:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 16.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:45:21] (03PS1) 10Elukey: CHANGELOG: add changelogs for release v0.7.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1055877 [08:47:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 16.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:47:20] (03CR) 10Ayounsi: [C:03+1] CHANGELOG: add changelogs for release v0.7.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1055877 (owner: 10Elukey) [08:47:42] (03CR) 10Elukey: [C:03+2] CHANGELOG: add changelogs for release v0.7.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1055877 (owner: 10Elukey) [08:48:06] (03Abandoned) 10Elukey: CHANGELOG: add changelogs for release v0.6.7 [software/homer] - 10https://gerrit.wikimedia.org/r/1054543 (owner: 10Ayounsi) [08:50:33] !log ayounsi@cumin1002 START - Cookbook sre.postgresql.postgres-init [08:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 4.005% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:55:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [08:56:01] !log rebalance mediawiki.httpd.accesslog partitions across brokers - T370129 [08:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:05] T370129: topicmappr marshal error on kafka-logging cluster - https://phabricator.wikimedia.org/T370129 [08:56:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10001578 (10cmooney) >>! In T363576#9975780, @Papaul wrote: > @wiki_willy I did more tests on this pxe boot issue we are having with the 10G Dell NIC... [08:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 4.005% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:59:20] RESOLVED: SystemdUnitFailed: postgresql@15-main.service on netboxdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:00:23] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275 [09:00:29] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [09:01:19] RESOLVED: PuppetFailure: Puppet has failed on netboxdb2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:03:45] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275 [09:07:55] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [09:09:20] FIRING: [6x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:14:20] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:38] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Unnecessary horizontal scrollbars - https://phabricator.wikimedia.org/T283028#10001607 (10Aklapper) 05Open→03Resolved No reply; cannot reproduce; optimistically resolving [09:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:45] FIRING: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:21:01] that's me ^ [09:21:32] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275 [09:21:37] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [09:24:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:25:37] FIRING: [6x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:29:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:01] (03CR) 10Matthias Mullie: [C:03+2] Reduce weight of 'main subject' as it's used inconsistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055258 (https://phabricator.wikimedia.org/T367774) (owner: 10Cparle) [09:30:21] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Release v4.0.7 to future netbox prod - ayounsi@cumin1002 - T336275 [09:30:26] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [09:30:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:30:38] (03Merged) 10jenkins-bot: Reduce weight of 'main subject' as it's used inconsistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055258 (https://phabricator.wikimedia.org/T367774) (owner: 10Cparle) [09:33:20] Ah FFS. I just (for the 2nd time in a couple of weeks) +2ed a config change that I was only meant to +1. It's now merged and too late to withdraw +2; any objections against me deploying it right away? [09:34:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:20] FIRING: [6x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:27] (03PS3) 10Hashar: openstack: remove OpenTofu git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) [09:35:18] (03CR) 10Hashar: "I have rebased the change which conflicted with Ia20cb6339b37c793d5dcf460283dd0f6faccdfcc" [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:35:33] (03PS1) 10Ayounsi: Revert "Cumin aliases: hardcode current Netbox prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/1055881 [09:36:37] (03PS1) 10Giuseppe Lavagetto: [stub] confctl: add support for dry-run in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 [09:38:23] (03PS2) 10Giuseppe Lavagetto: [stub] confctl: add support for dry-run in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 [09:39:46] (03CR) 10Ayounsi: [C:03+2] Revert "Cumin aliases: hardcode current Netbox prod servers" [puppet] - 10https://gerrit.wikimedia.org/r/1055881 (owner: 10Ayounsi) [09:40:33] !log homer 'cr*codfw*' commit 'T351074' [09:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:37] FIRING: [31x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:38] RESOLVED: [6x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:38] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [09:42:29] !log mlitn@deploy1002 Started scap sync-world: Backport for [[gerrit:1055258|Reduce weight of 'main subject' as it's used inconsistently (T367774)]] [09:42:33] T367774: Deal with huge sets of uploads with "main subject" statements - https://phabricator.wikimedia.org/T367774 [09:44:50] (03CR) 10CI reject: [V:04-1] [stub] confctl: add support for dry-run in conftool [software/spicerack] - 10https://gerrit.wikimedia.org/r/1055882 (owner: 10Giuseppe Lavagetto) [09:44:54] !log mlitn@deploy1002 cparle, mlitn: Backport for [[gerrit:1055258|Reduce weight of 'main subject' as it's used inconsistently (T367774)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:45:36] !log mlitn@deploy1002 cparle, mlitn: Continuing with sync [09:50:49] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:1055258|Reduce weight of 'main subject' as it's used inconsistently (T367774)]] (duration: 08m 19s) [09:50:53] T367774: Deal with huge sets of uploads with "main subject" statements - https://phabricator.wikimedia.org/T367774 [09:51:22] Done, apologies for the interruption :) [09:51:58] !log set mediawiki.httpd.accesslog topic retention to 26h temporarily [09:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:01] (03CR) 10Ayounsi: [C:03+2] Netbox 4: point prod service to new servers [puppet] - 10https://gerrit.wikimedia.org/r/1055187 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:53:10] (03PS6) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [09:54:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:54:45] (03CR) 10Tchanders: Enable temporary accounts on testwiki and loginwiki (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [09:54:46] (03PS4) 10Effie Mouzeli: thumbor: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055872 [09:54:52] (03PS1) 10David Caro: kubeadm: add helm-sudo as pair of kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/1055885 [09:55:24] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "I don't fully understand the rationale behind this, but the change looks harmless, so giving a +1." [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:56:17] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: remove OpenTofu git::clone file mode [puppet] - 10https://gerrit.wikimedia.org/r/1054890 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [09:59:20] FIRING: [28x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1000) [10:00:39] (03PS1) 10Jelto: gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) [10:02:21] jouncebot: nowandnext [10:02:21] For the next 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1000) [10:02:21] In 2 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1300) [10:02:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:04:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:11] FIRING: Temperature: GPU1 Temp issue on ml-staging2003:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=ml-staging2003 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:07:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (GET pods) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:09:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:11] RESOLVED: Temperature: GPU1 Temp issue on ml-staging2003:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=ml-staging2003 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [10:13:22] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [software/statograph] - 10https://gerrit.wikimedia.org/r/1055875 (https://phabricator.wikimedia.org/T370386) (owner: 10Volans) [10:15:37] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:45] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, I'll let Keith comment too tho" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [10:19:20] FIRING: [29x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:24:14] (03PS1) 10Ayounsi: Enable OIDC auth on new netbox [puppet] - 10https://gerrit.wikimedia.org/r/1055887 (https://phabricator.wikimedia.org/T336275) [10:24:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055887 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:24:47] !log kafka preferred-replica-election on kafka-main - T370574 [10:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:51] T370574: kafka2001 seems out of sync with the rest of the cluster - https://phabricator.wikimedia.org/T370574 [10:28:42] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:04] (03CR) 10Ayounsi: [C:03+2] Enable OIDC auth on new netbox [puppet] - 10https://gerrit.wikimedia.org/r/1055887 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:29:09] (03CR) 10Elukey: [C:03+1] Enable OIDC auth on new netbox [puppet] - 10https://gerrit.wikimedia.org/r/1055887 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:32:39] !log Running `mwscript extensions/MediaModeration/maintenance/updateMetrics.php --wiki=commonswiki --verbose` [10:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:08] !log upgraded manually prometheus-ipmi-exporter to v 1.8.0-1~wmf12+1 on db1179 (leftover because was down) T368088 [10:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:12] T368088: upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088 [10:37:21] 06SRE, 10ConfirmEdit (CAPTCHA extension), 10WMF-General-or-Unknown: Remove words with apostrophes from captcha wordlist - https://phabricator.wikimedia.org/T370531#10001866 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [10:39:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:13] (03PS1) 10DCausse: GeoData: add pool counter settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055890 (https://phabricator.wikimedia.org/T370621) [10:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:34] FIRING: [2x] KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster logging-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [11:07:40] jouncebot: nowandnext [11:07:40] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [11:07:40] In 1 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1300) [11:07:45] awesome [11:08:12] (03PS6) 10Ebrahim: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) [11:08:14] (03CR) 10Ladsgroup: [C:03+2] Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) (owner: 10Ebrahim) [11:08:35] (03CR) 10Clément Goubert: [C:03+1] mediawiki: make mcrouter deployment compatible with mcrouter 1.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055871 (owner: 10Effie Mouzeli) [11:08:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) (owner: 10Ebrahim) [11:08:52] (03Merged) 10jenkins-bot: Enable ICU provided alphabetical order in the Kurdish wikis categories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054641 (https://phabricator.wikimedia.org/T48235) (owner: 10Ebrahim) [11:09:08] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1054641|Enable ICU provided alphabetical order in the Kurdish wikis categories (T48235)]] [11:09:13] T48235: Kurdish Wikipedia: Alphabetical order in the categories (collation) - https://phabricator.wikimedia.org/T48235 [11:11:39] !log ladsgroup@deploy1002 ebrahim, ladsgroup: Backport for [[gerrit:1054641|Enable ICU provided alphabetical order in the Kurdish wikis categories (T48235)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:12:22] !log ladsgroup@deploy1002 ebrahim, ladsgroup: Continuing with sync [11:14:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:27] (03PS2) 10Effie Mouzeli: app.job: update module (vanilla) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049571 (https://phabricator.wikimedia.org/T356885) [11:14:38] (03PS8) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [11:15:31] (03PS1) 10Ayounsi: Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) [11:15:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:42] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [11:16:17] (03PS2) 10Ayounsi: Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) [11:16:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:17:11] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1054641|Enable ICU provided alphabetical order in the Kurdish wikis categories (T48235)]] (duration: 08m 02s) [11:17:17] T48235: Kurdish Wikipedia: Alphabetical order in the categories (collation) - https://phabricator.wikimedia.org/T48235 [11:17:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:26] (03PS9) 10Effie Mouzeli: (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [11:20:19] (03CR) 10CI reject: [V:04-1] (WIP) modules/app: update to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [11:21:07] (03PS3) 10Ayounsi: Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) [11:21:16] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:22:14] (03PS4) 10Ayounsi: Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) [11:22:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:23:38] (03CR) 10Elukey: [C:03+1] Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:23:46] (03PS4) 10Jelto: gitlab: add option to throttle and drop traffic using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) [11:24:52] (03CR) 10Ayounsi: [C:03+2] Netbox 4: update cas_server_url to fit OIDC endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1055893 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:25:38] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:34:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:35] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055899 [11:45:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:45] (03PS1) 10Ayounsi: Netbox 4: set correct oidc_service [puppet] - 10https://gerrit.wikimedia.org/r/1055900 (https://phabricator.wikimedia.org/T336275) [11:48:25] (03PS2) 10Ayounsi: Netbox 4: set correct oidc_service [puppet] - 10https://gerrit.wikimedia.org/r/1055900 (https://phabricator.wikimedia.org/T336275) [11:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:26] (03CR) 10Ayounsi: [C:03+2] Netbox 4: set correct oidc_service [puppet] - 10https://gerrit.wikimedia.org/r/1055900 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [11:53:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [11:54:21] 06SRE, 06Infrastructure-Foundations, 10netops: Set Leaf switches in Codfw rows C & D to active and make new vlans live - https://phabricator.wikimedia.org/T370629 (10cmooney) 03NEW p:05Triage→03Medium [11:54:29] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [11:59:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:44] (03PS1) 10Ayounsi: IDM: Set profile_format FLAT to netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) [11:59:58] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:00:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:02:02] (03PS2) 10Ayounsi: IDM: Set profile_format FLAT to netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) [12:02:14] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:03:49] (03CR) 10Btullis: "Great! Almost ready to go." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:04:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:41] (03CR) 10Vgutierrez: [C:03+2] hiera: Extend bwlim experiment to upload@ulsfo|eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1055252 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [12:08:35] (03CR) 10Elukey: [C:03+1] IDM: Set profile_format FLAT to netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:08:45] (03CR) 10Ayounsi: [C:03+2] IDM: Set profile_format FLAT to netbox_oidc [puppet] - 10https://gerrit.wikimedia.org/r/1055901 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:09:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:47] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630 (10cmooney) 03NEW p:05Triage→03Medium [12:12:49] (03CR) 10Vgutierrez: [C:03+1] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [12:15:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:38] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1055908 (owner: 10L10n-bot) [12:19:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:21:51] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055912 [12:22:46] !log restore retention.ms=172800000 for mediawiki.httpd.accesslog [12:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:41] (03CR) 10Vgutierrez: "almost there, fix 12-rate-limiting.vtc and it should be good to go, nice job!" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [12:33:45] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [12:35:24] (03PS1) 10Filippo Giunchedi: kafka: use instance-based selection for kafka-kit storage metrics [puppet] - 10https://gerrit.wikimedia.org/r/1055922 (https://phabricator.wikimedia.org/T370129) [12:35:55] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [12:35:58] (03CR) 10Filippo Giunchedi: "I gave fixing the configuration for kafka-logging a go, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/1055922 (https://phabricator.wikimedia.org/T370129) (owner: 10Filippo Giunchedi) [12:37:13] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3339/co" [puppet] - 10https://gerrit.wikimedia.org/r/1055922 (https://phabricator.wikimedia.org/T370129) (owner: 10Filippo Giunchedi) [12:37:17] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [12:38:12] (03CR) 10Kosta Harlan: Enable temporary accounts on testwiki and loginwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [12:39:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:12] 06SRE, 06Infrastructure-Foundations, 10netops: Add data to automation for new switches in codfw C/D - https://phabricator.wikimedia.org/T369106#10002128 (10cmooney) [12:42:25] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#10002129 (10Milimetric) Sorry, I just signed it, I'm sure I signed it or some form of it at some point before, I've been an employee for like 12 years almost :P [12:44:18] (03PS7) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [12:44:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:53] (03PS8) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [12:45:15] (03CR) 10Tchanders: Enable temporary accounts on testwiki and loginwiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [12:45:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:20] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:37] FIRING: [27x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:28] (03CR) 10Brouberol: [C:03+1] kafka: use instance-based selection for kafka-kit storage metrics [puppet] - 10https://gerrit.wikimedia.org/r/1055922 (https://phabricator.wikimedia.org/T370129) (owner: 10Filippo Giunchedi) [12:54:33] (03PS1) 10Volans: ganeti-netbox-sync: Netbox 4 fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1055924 (https://phabricator.wikimedia.org/T336275) [12:55:36] (03CR) 10Ayounsi: [C:03+1] ganeti-netbox-sync: Netbox 4 fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1055924 (https://phabricator.wikimedia.org/T336275) (owner: 10Volans) [12:56:45] (03CR) 10Volans: [C:03+2] ganeti-netbox-sync: Netbox 4 fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1055924 (https://phabricator.wikimedia.org/T336275) (owner: 10Volans) [12:57:48] (03Merged) 10jenkins-bot: ganeti-netbox-sync: Netbox 4 fix [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1055924 (https://phabricator.wikimedia.org/T336275) (owner: 10Volans) [12:59:37] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] kafka: use instance-based selection for kafka-kit storage metrics [puppet] - 10https://gerrit.wikimedia.org/r/1055922 (https://phabricator.wikimedia.org/T370129) (owner: 10Filippo Giunchedi) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1300). [13:00:05] Tchanders, HouseOfM, and physikerwelt: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] * Lucas_WMDE can’t deploy, in a meeting [13:00:58] o/ [13:00:58] o/ [13:00:59] o/ [13:01:01] I'm here [13:02:00] I'll start I guess! [13:02:41] \o [13:06:04] (03PS1) 10Filippo Giunchedi: titan: bring thanos 5m retention to 54w [puppet] - 10https://gerrit.wikimedia.org/r/1055928 (https://phabricator.wikimedia.org/T351927) [13:06:25] All deployers are busy today? [13:07:09] Thalia is working on getting scap backport running [13:07:20] There were some issues with some screensharing in a meeting [13:07:28] ty [13:07:38] (03PS3) 10Tchanders: Set Flow to read only on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) [13:07:46] !log power cycling rdb1014.eqiad.wmnet [13:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) (owner: 10Tchanders) [13:08:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:09:04] (03Merged) 10jenkins-bot: Set Flow to read only on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054921 (https://phabricator.wikimedia.org/T370322) (owner: 10Tchanders) [13:09:40] (03PS9) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) [13:10:25] (03CR) 10Dreamy Jazz: [C:03+2] Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:10:43] (03CR) 10TrainBranchBot: "Approved by tchanders@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:11:01] (03Merged) 10jenkins-bot: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054625 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:11:18] !log tchanders@deploy1002 Started scap sync-world: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]] [13:11:25] T370322: Set Flow to read only for testwiki - https://phabricator.wikimedia.org/T370322 [13:11:25] T348895: [Epic] Temporary accounts testwiki deployment - https://phabricator.wikimedia.org/T348895 [13:13:45] !log tchanders@deploy1002 tchanders: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:20] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:57] XioNoX: known? ^ [13:15:30] yes [13:15:37] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:53] ok [13:16:40] sukhe: yes, WIP migration of netbox to netbox 4 [13:16:46] thanks [13:19:20] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:30] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:20:25] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on netbox2002.codfw.wmnet with reason: Netbox 3 silencing [13:20:37] FIRING: [24x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:39] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on netbox2002.codfw.wmnet with reason: Netbox 3 silencing [13:20:50] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on netbox1002.eqiad.wmnet with reason: Netbox 3 silencing [13:21:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on netbox1002.eqiad.wmnet with reason: Netbox 3 silencing [13:22:48] We are trying to debug an issue with one of the patches [13:23:39] Dreamy_Jazz: what's the issue? [13:24:07] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633 (10Clement_Goubert) 03NEW p:05Triage→03High [13:24:14] https://test.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&meta=siteinfo&formatversion=2&siprop=autocreatetempuser is the symptom of this problem [13:24:22] i.e. "enabled" is being returned as "false" [13:25:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on rdb1014.eqiad.wmnet with reason: Hardware issue [13:25:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on rdb1014.eqiad.wmnet with reason: Hardware issue [13:25:37] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10002243 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9bfde8c2-0f71-4de8-908c-ff3a74fdbe71) set by cgoubert@cumin1002 for 7 da... [13:25:37] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:04] We've found the cause of the issue. Should be able to write a fix shortly and continue on. [13:27:04] (03PS1) 10Ayounsi: Netbox 4: ACLs + breaking changes [puppet] - 10https://gerrit.wikimedia.org/r/1055932 (https://phabricator.wikimedia.org/T336275) [13:27:05] Dreamy_Jazz: hmm, yeah on cs beta temp accounts are disabled as well https://cs.wikipedia.beta.wmflabs.org/wiki/Test [13:27:27] (03CR) 10CI reject: [V:04-1] Netbox 4: ACLs + breaking changes [puppet] - 10https://gerrit.wikimedia.org/r/1055932 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:28:50] (03PS2) 10Ayounsi: Netbox 4: ACLs + breaking changes [puppet] - 10https://gerrit.wikimedia.org/r/1055932 (https://phabricator.wikimedia.org/T336275) [13:29:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:24] (03PS1) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055933 (https://phabricator.wikimedia.org/T348895) [13:29:39] kostajh: I think we will fix it shortly with the config fix [13:29:52] !log tchanders@deploy1002 Sync cancelled. [13:30:10] Sorry, technical issues, sorting... [13:30:25] (03PS2) 10Tchanders: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055933 (https://phabricator.wikimedia.org/T348895) [13:30:37] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:14] (03CR) 10Papaul: [C:03+2] Add franio200[2-3] to DNS file [dns] - 10https://gerrit.wikimedia.org/r/1055481 (owner: 10Papaul) [13:33:20] @Tchanders I can also wait until after 14h UTC [13:33:51] (03CR) 10Elukey: [C:03+1] Netbox 4: ACLs + breaking changes [puppet] - 10https://gerrit.wikimedia.org/r/1055932 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:34:03] (03CR) 10Ayounsi: [C:03+2] Netbox 4: ACLs + breaking changes [puppet] - 10https://gerrit.wikimedia.org/r/1055932 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:35:20] (03Abandoned) 10Dreamy Jazz: Enable temporary accounts on testwiki and loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055933 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:35:29] (03CR) 10Filippo Giunchedi: [C:03+2] titan: bring thanos 5m retention to 54w [puppet] - 10https://gerrit.wikimedia.org/r/1055928 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [13:35:39] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10002299 (10Papaul) [13:35:50] your patch is labs-only, so it can just be merged without separate deployment [13:36:08] fetching onto the deployment host is enough [13:36:27] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10002302 (10Clement_Goubert) [13:36:49] zabe: I think I can not do that myself [13:37:33] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: Q1:rack/setup/install franio200[1-3] - https://phabricator.wikimedia.org/T367819#10002304 (10Papaul) a:05Papaul→03Dwisehaupt @Dwisehaupt all your's [13:38:03] thats correct, but I mainly wanted to point out that it is fine to just merge your patch outside the deployment window (but I think I also misread your message a bit) [13:38:12] (03PS1) 10Tchanders: Fix logic for handling enabling temporary accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055937 (https://phabricator.wikimedia.org/T348895) [13:38:42] (03CR) 10Dreamy Jazz: [C:03+2] Fix logic for handling enabling temporary accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055937 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:38:57] (03CR) 10STran: [C:03+1] Fix logic for handling enabling temporary accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055937 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:39:20] (03Merged) 10jenkins-bot: Fix logic for handling enabling temporary accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055937 (https://phabricator.wikimedia.org/T348895) (owner: 10Tchanders) [13:39:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635 (10cmooney) 03NEW p:05Triage→03Medium [13:39:57] !log tchanders@deploy1002 Started scap sync-world: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]], [[gerrit:1055937|Fix logic for handling enabling temporary accounts (T348895)]] [13:40:02] T370322: Set Flow to read only for testwiki - https://phabricator.wikimedia.org/T370322 [13:40:03] T348895: [Epic] Temporary accounts testwiki deployment - https://phabricator.wikimedia.org/T348895 [13:40:07] 06SRE: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#10002333 (10Papaul) [13:40:24] 06SRE: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#10002330 (10Papaul) a:03Clement_Goubert @Clement_Goubert hello we are going to assign this task to you and remove the dc-ops tag on it since you are doing some testing on this server. You can resolve it when done. Thank you [13:40:37] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:40:50] zabe: My understand is that the deployment will happen in the order from the wikipage https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1300 and since time to 14UTC is running out, I wanted to state that I'm available longer. [13:41:11] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10002337 (10Clement_Goubert) [13:41:53] alright:) [13:42:16] 06SRE: Degraded RAID on mw2432 - https://phabricator.wikimedia.org/T370258#10002344 (10Clement_Goubert) 05Open→03Invalid @Papaul Yep, sorry, it can be closed now. Cheers. [13:42:23] !log tchanders@deploy1002 tchanders: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]], [[gerrit:1055937|Fix logic for handling enabling temporary accounts (T348895)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:44:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:32] (03PS1) 10Ayounsi: Netbox 4: fix script path vs. extra path [puppet] - 10https://gerrit.wikimedia.org/r/1055940 (https://phabricator.wikimedia.org/T336275) [13:45:11] !log tchanders@deploy1002 tchanders: Continuing with sync [13:45:20] (03PS2) 10Ayounsi: Netbox 4: fix script path vs. extra path [puppet] - 10https://gerrit.wikimedia.org/r/1055940 (https://phabricator.wikimedia.org/T336275) [13:45:37] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:53] (03PS1) 10Sohom Datta: Do not unreview pages when they are moved [extensions/PageTriage] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055941 (https://phabricator.wikimedia.org/T370593) [13:49:07] (03CR) 10Cathal Mooney: "Seems safe to me, the "lvs" rules it specify have since been modified to only block towards the LVS service IPs, authdns traffic will not " [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [13:49:20] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/PageTriage] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055941 (https://phabricator.wikimedia.org/T370593) (owner: 10Sohom Datta) [13:49:58] (03PS2) 10Daimona Eaytoy: [arwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) [13:50:05] (03PS1) 10Brouberol: growthbook: add svc.eqiad and discovery records [dns] - 10https://gerrit.wikimedia.org/r/1055943 (https://phabricator.wikimedia.org/T365839) [13:50:37] FIRING: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:41] FIRING: [2x] ProbeDown: Service wikikube-ctrl2002:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:50:57] !incidents [13:50:57] 4903 (UNACKED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [13:50:57] 4902 (RESOLVED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [13:51:04] !ack 4903 [13:51:05] 4903 (ACKED) [2x] ProbeDown sre (wikikube-ctrl2002:6443 probes/custom codfw) [13:51:24] topranks: doing something on asw-c-codfw? [13:51:34] (03CR) 10Herron: [C:03+1] "Thanks, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:51:38] claime: no... what's up? [13:51:42] I bumped something! [13:51:44] My nad [13:51:46] bad [13:51:49] ah ok [13:51:50] (03CR) 10Jsn.sherman: [C:03+1] "LGTM!" [extensions/PageTriage] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055941 (https://phabricator.wikimedia.org/T370593) (owner: 10Sohom Datta) [13:52:00] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [dns] - 10https://gerrit.wikimedia.org/r/1055943 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:52:12] JennH: Cool -- should recover soon? [13:52:15] JennH: ack :D [13:52:21] yeah should just be a moment [13:52:25] (03CR) 10Elukey: [C:03+1] Netbox 4: fix script path vs. extra path [puppet] - 10https://gerrit.wikimedia.org/r/1055940 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:52:41] (03CR) 10Ayounsi: [C:03+2] Netbox 4: fix script path vs. extra path [puppet] - 10https://gerrit.wikimedia.org/r/1055940 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:52:48] (03CR) 10Brouberol: [C:03+2] growthbook: add svc.eqiad and discovery records [dns] - 10https://gerrit.wikimedia.org/r/1055943 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [13:53:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:53:27] all better but still internally screaming [13:54:01] BGP back up to that one host for past 60+ seconds [13:54:05] cmooney@re0.cr1-codfw> show bgp summary | match "10.192.32.76|2620:0:860:103:10:192:32:76" [13:54:05] 10.192.32.76 64602 11 239 0 1 1:06 Establ [13:54:05] 2620:0:860:103:10:192:32:76 64602 17 239 0 1 1:08 Establ [13:54:23] JennH: The worst kind of screaming. Thanks for letting us know -- no stress though, these things happen (: [13:54:27] JennH: hey at least you didn't break the _entire_ datacentre like I did Friday :( [13:54:42] oh no! [13:54:50] FIRING: KubernetesCalicoDown: wikikube-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:55:12] FIRING: [23x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:17] RESOLVED: [2x] ProbeDown: Service titan2001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan2001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:21] I'm getting a "backport failed" message, not sure what to do next [13:55:32] no matter how much you try, you can't beat crowdstrike, so you're safe ;) [13:55:48] RESOLVED: [2x] ProbeDown: Service wikikube-ctrl2002:6443 has failed probes (http_codfw_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl2002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:56:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10002431 (10elukey) >>! In T363576#10001399, @ayounsi wrote: > Amazing progress ! > >> Is it a problem with lpxelinux.0, the NIC firmwares interacting... [13:56:40] Tchanders: Do you want to discuss in the meeting for this? [13:56:44] Flurry of [{reqId}] {exception_url} LogicException: Process cache for 'en' should be set by now. errors [13:56:51] back down now [13:57:01] Dreamy_Jazz: yes [13:57:35] JennH: *hugops* [13:57:35] (03PS1) 10Ayounsi: Netbox config: fix one oversight [puppet] - 10https://gerrit.wikimedia.org/r/1055947 [13:57:40] FIRING: [41x] KubernetesRsyslogDown: rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:58:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:58:18] (03CR) 10Ayounsi: [C:03+2] Netbox config: fix one oversight [puppet] - 10https://gerrit.wikimedia.org/r/1055947 (owner: 10Ayounsi) [13:59:42] RESOLVED: KubernetesCalicoDown: wikikube-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:59:53] (03CR) 10Abijeet Patro: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1055908 (owner: 10L10n-bot) [13:59:59] FIRING: [22x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:16] hmm I'm gonna restart rsyslog on these codfw k8s nodes [14:00:16] (03PS1) 10Zabe: mailmap: Add mapping for Zabe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055948 [14:00:37] FIRING: [22x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:39] claime:: Can we re-try scap backport for that change? [14:00:44] Dreamy_Jazz: yeah [14:01:06] should work now, I bet it just failed because a bunch of nodes were unreachable for a bit [14:01:10] !log tchanders@deploy1002 Started scap sync-world: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]], [[gerrit:1055937|Fix logic for handling enabling temporary accounts (T348895)]] [14:01:22] T370322: Set Flow to read only for testwiki - https://phabricator.wikimedia.org/T370322 [14:01:22] T348895: [Epic] Temporary accounts testwiki deployment - https://phabricator.wikimedia.org/T348895 [14:01:51] (03CR) 10Andrea Denisse: [C:03+1] grafana: clone grafana-grizzly with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:02:40] FIRING: [63x] KubernetesRsyslogDown: rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:03:32] !log tchanders@deploy1002 tchanders: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]], [[gerrit:1055937|Fix logic for handling enabling temporary accounts (T348895)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:38] !log tchanders@deploy1002 tchanders: Continuing with sync [14:03:56] 10ops-codfw, 06SRE, 06DC-Ops: PSU down on asw-c7-codfw - https://phabricator.wikimedia.org/T370575#10002446 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm I booped it the wrong way, but both PSUs are up now. [14:07:40] RESOLVED: [63x] KubernetesRsyslogDown: rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:08:22] !log tchanders@deploy1002 Finished scap: Backport for [[gerrit:1054921|Set Flow to read only on testwiki (T370322)]], [[gerrit:1054625|Enable temporary accounts on testwiki and loginwiki (T348895)]], [[gerrit:1055937|Fix logic for handling enabling temporary accounts (T348895)]] (duration: 07m 11s) [14:08:27] T370322: Set Flow to read only for testwiki - https://phabricator.wikimedia.org/T370322 [14:08:27] T348895: [Epic] Temporary accounts testwiki deployment - https://phabricator.wikimedia.org/T348895 [14:09:20] FIRING: [17x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:31] Is anyone else around to continue the window? [14:10:13] I am still here, but I can't deploy. [14:10:57] sure [14:11:15] (03PS2) 10Physikerwelt: Enable MathJax rendering in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055395 (https://phabricator.wikimedia.org/T370507) [14:11:26] (03CR) 10Zabe: [C:03+2] Enable MathJax rendering in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055395 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [14:11:52] ok, HouseOfM left [14:12:15] (03Merged) 10jenkins-bot: Enable MathJax rendering in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055395 (https://phabricator.wikimedia.org/T370507) (owner: 10Physikerwelt) [14:12:27] (03CR) 10Andrew Bogott: [C:03+2] Changeprop beta: replace mediawiki11 with new instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055528 (https://phabricator.wikimedia.org/T361387) (owner: 10Southparkfan) [14:12:49] physikerwelt: your patch should reach labs within 15 min [14:12:50] (03CR) 10Andrew Bogott: [C:03+2] changeprop beta: fix purge_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055529 (owner: 10Southparkfan) [14:12:56] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: remove Buster appservers [puppet] - 10https://gerrit.wikimedia.org/r/1055530 (https://phabricator.wikimedia.org/T361387) (owner: 10Southparkfan) [14:13:20] perfect thank you [14:13:25] (03CR) 10Zabe: [C:03+2] mailmap: Add mapping for Zabe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055948 (owner: 10Zabe) [14:13:31] (03CR) 10Zabe: [C:03+2] Revert^2 "Set some site names for new-ish wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055614 (https://phabricator.wikimedia.org/T363270) (owner: 10Zabe) [14:13:36] (03PS2) 10Zabe: Revert^2 "Set some site names for new-ish wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055614 (https://phabricator.wikimedia.org/T363270) [14:13:37] (03Merged) 10jenkins-bot: Changeprop beta: replace mediawiki11 with new instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055528 (https://phabricator.wikimedia.org/T361387) (owner: 10Southparkfan) [14:13:46] (03CR) 10Zabe: [C:03+2] Revert^2 "Set some site names for new-ish wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055614 (https://phabricator.wikimedia.org/T363270) (owner: 10Zabe) [14:14:02] (03Merged) 10jenkins-bot: changeprop beta: fix purge_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055529 (owner: 10Southparkfan) [14:14:20] FIRING: [16x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:14:22] (03Merged) 10jenkins-bot: mailmap: Add mapping for Zabe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055948 (owner: 10Zabe) [14:14:29] (03Merged) 10jenkins-bot: Revert^2 "Set some site names for new-ish wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055614 (https://phabricator.wikimedia.org/T363270) (owner: 10Zabe) [14:14:44] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: point wgEchoPushServiceBaseUrl to service record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055531 (https://phabricator.wikimedia.org/T370459) (owner: 10Southparkfan) [14:15:05] !log zabe@deploy1002 Started scap sync-world: Backport for [[gerrit:1055614|Revert^2 "Set some site names for new-ish wikis" (T363270 T360303 T360310 T363263)]] [14:15:15] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [14:15:15] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [14:15:15] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [14:15:16] T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263 [14:15:21] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: add Bullseye parsoid servers [puppet] - 10https://gerrit.wikimedia.org/r/1055532 (https://phabricator.wikimedia.org/T361386) (owner: 10Southparkfan) [14:15:37] FIRING: [16x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:15:42] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: replace parsoid endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055533 (https://phabricator.wikimedia.org/T361386) (owner: 10Southparkfan) [14:16:13] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: add mwmaint03, Bullseye instance [puppet] - 10https://gerrit.wikimedia.org/r/1055534 (https://phabricator.wikimedia.org/T370582) (owner: 10Southparkfan) [14:16:21] (03Merged) 10jenkins-bot: LabsServices: replace parsoid endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055533 (https://phabricator.wikimedia.org/T361386) (owner: 10Southparkfan) [14:16:32] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:16:39] (03CR) 10CI reject: [V:04-1] LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:16:42] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: add svc for zotero http_proxy [puppet] - 10https://gerrit.wikimedia.org/r/1055540 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:17:33] !log zabe@deploy1002 zabe: Backport for [[gerrit:1055614|Revert^2 "Set some site names for new-ish wikis" (T363270 T360303 T360310 T363263)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:19:20] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055206 (https://phabricator.wikimedia.org/T370066) (owner: 10Daimona Eaytoy) [14:19:52] (03PS3) 10Southparkfan: LabsServices: point wgEchoPushServiceBaseUrl to service record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055531 (https://phabricator.wikimedia.org/T370459) [14:21:01] !log zabe@deploy1002 zabe: Continuing with sync [14:21:40] (03PS2) 10Southparkfan: LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) [14:22:09] (03CR) 10CI reject: [V:04-1] LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:26:00] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1055614|Revert^2 "Set some site names for new-ish wikis" (T363270 T360303 T360310 T363263)]] (duration: 10m 54s) [14:26:07] T363270: Post-creation work for mywikisource - https://phabricator.wikimedia.org/T363270 [14:26:08] T360303: Post-creation work for kuswiki - https://phabricator.wikimedia.org/T360303 [14:26:08] T360310: Post-creation work for bewwiki - https://phabricator.wikimedia.org/T360310 [14:26:09] T363263: Post-creation work for iglwiki - https://phabricator.wikimedia.org/T363263 [14:26:59] (03CR) 10Andrew Bogott: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055531 (https://phabricator.wikimedia.org/T370459) (owner: 10Southparkfan) [14:27:38] (03Merged) 10jenkins-bot: LabsServices: point wgEchoPushServiceBaseUrl to service record [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055531 (https://phabricator.wikimedia.org/T370459) (owner: 10Southparkfan) [14:29:02] (03PS3) 10Southparkfan: LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) [14:29:26] (03PS4) 10Southparkfan: LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) [14:31:59] (03CR) 10Andrew Bogott: [C:03+2] LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:32:38] (03Merged) 10jenkins-bot: LabsServices: change url-downloader service endpoint [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055539 (https://phabricator.wikimedia.org/T370466) (owner: 10Southparkfan) [14:35:32] zabe thank you everything worked out smoothly. [14:39:20] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:22] (03PS3) 10Brouberol: Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:43:18] (03PS1) 10Dreamy Jazz: [WIP] Disable temporary accounts on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055953 [14:43:57] (03CR) 10Dreamy Jazz: [C:04-2] "Only merge this if we need to revert the deployment. Linked to by https://phabricator.wikimedia.org/T366960#10002664" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055953 (owner: 10Dreamy Jazz) [14:44:53] (03CR) 10Btullis: Create a new chart for growbook using scaffolding. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:45:57] (03PS4) 10Brouberol: Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:46:50] (03CR) 10CI reject: [V:04-1] Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:49:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10002716 (10joanna_borun) p:05Triage→03High a:03cmooney [14:50:31] 06SRE, 10Continuous-Integration-Infrastructure, 06Infrastructure-Foundations, 06Release-Engineering-Team: package_builder python-all conflicts with base::standard_packages python2.7 removal - https://phabricator.wikimedia.org/T370337#10002724 (10jhathaway) Y'all need any help on this, or did @dzahn's patch... [14:50:38] (03PS2) 10Ebernhardson: beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) [14:52:26] (03PS6) 10Brouberol: Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:53:13] (03CR) 10Brouberol: Create a new chart for growbook using scaffolding. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:53:27] (03PS7) 10Brouberol: Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [14:53:34] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks! +1 on Arzhel's suggestion to remove it from definitions/static in this commit as well since ns_group isn't being used " [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:55:24] 06SRE, 10MW-on-K8s, 06serviceops: Update wikitech documentation - https://phabricator.wikimedia.org/T370646 (10Clement_Goubert) 03NEW [14:56:37] jouncebot nowandnext [14:56:37] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [14:56:37] In 0 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1530) [14:57:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [14:58:20] (03CR) 10Ssingh: "(Resetting the vote till definitions/static update)" [homer/public] - 10https://gerrit.wikimedia.org/r/1055546 (https://phabricator.wikimedia.org/T370156) (owner: 10Southparkfan) [14:58:49] (03Merged) 10jenkins-bot: MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053752 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [14:59:30] !log dancy@deploy1002 Started scap sync-world: Backport for [[gerrit:1053752|MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (T369115)]] [14:59:35] T369115: [WE6.2.1] Publish pre-train single version containers - https://phabricator.wikimedia.org/T369115 [15:00:38] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:50] !log dancy@deploy1002 dancy: Backport for [[gerrit:1053752|MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (T369115)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:02:17] 06SRE, 10MW-on-K8s, 06serviceops: Update Parsoid wikitech documentation following mw-on-k8s migration - https://phabricator.wikimedia.org/T370646#10002774 (10Aklapper) [15:03:15] (03PS8) 10Brouberol: Create a new chart for growbook using scaffolding. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [15:03:41] !log dancy@deploy1002 dancy: Continuing with sync [15:04:39] (03PS1) 10Southparkfan: deployment-prep: remove Buster parsoid and mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1055956 (https://phabricator.wikimedia.org/T361386) [15:05:16] (03PS1) 10Ayounsi: Netbox report timers: run as sre_bot user [puppet] - 10https://gerrit.wikimedia.org/r/1055957 [15:05:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [15:05:27] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10002809 (10elukey) Tried another time, but the post-commit hook failed to push to puppetmaster1001, I noticed a TCP SYN stuck... [15:06:00] (03CR) 10Andrew Bogott: [C:03+2] deployment-prep: remove Buster parsoid and mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/1055956 (https://phabricator.wikimedia.org/T361386) (owner: 10Southparkfan) [15:07:00] (03CR) 10Btullis: [C:03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055417 (https://phabricator.wikimedia.org/T365839) (owner: 10Btullis) [15:07:25] 06SRE, 10Maps, 07affects-Kiwix-and-openZIM: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T370650 (10Audiodude) 03NEW [15:07:52] 06SRE, 10Maps, 07affects-Kiwix-and-openZIM: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T370650#10002850 (10Audiodude) mwoffliner issue: https://github.com/openzim/mwoffliner/issues/2061 [15:08:41] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1053752|MWMultiVersion.php: Allow MW_FORCE_VERSION to pin the mw version (T369115)]] (duration: 09m 10s) [15:08:45] T369115: [WE6.2.1] Publish pre-train single version containers - https://phabricator.wikimedia.org/T369115 [15:09:45] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:10:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10002864 (10Papaul) @Jclark-ctr ok let me look [16:37:16] !log [doh1001] upgrade anycast-healthchecker to 0.9.8-1+wmf12u1: T370068 [16:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:20] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [16:39:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:39:56] (03PS2) 10David Caro: toolforge: remove pinning from the services node [puppet] - 10https://gerrit.wikimedia.org/r/1055976 (https://phabricator.wikimedia.org/T311914) [16:44:20] FIRING: [10x] SystemdUnitFailed: netbox_report_cables_run.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:39] (03CR) 10Vgutierrez: [C:03+1] varnish: show better error for 429s [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [16:49:31] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055488/3351/doc1003.eqiad.wmnet/change.doc1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (owner: 10Dzahn) [16:52:34] (03PS7) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) [16:53:46] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3352/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) (owner: 10BCornwall) [16:54:16] (03Abandoned) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047147 (owner: 10BCornwall) [16:54:19] (03Abandoned) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047150 (owner: 10BCornwall) [16:54:20] FIRING: [11x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:54:22] (03Abandoned) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047149 (owner: 10BCornwall) [16:54:25] (03Abandoned) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047148 (owner: 10BCornwall) [16:55:37] FIRING: [11x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1700) [17:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T1700). [17:00:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 23 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055653 (https://phabricator.wikimedia.org/T370387) (owner: 10KartikMistry) [17:02:29] (03CR) 10JHathaway: [C:04-1] profile::puppetmaster::frontend: allow puppetservers via ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055964 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [17:04:20] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:19] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678 (10RobH) 03NEW [17:07:08] 10ops-codfw, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10003764 (10RobH) [17:08:22] (03CR) 10David Caro: [C:03+2] "Tested in tools" [puppet] - 10https://gerrit.wikimedia.org/r/1055976 (https://phabricator.wikimedia.org/T311914) (owner: 10David Caro) [17:08:51] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [17:08:56] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 2:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [17:09:06] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [17:09:09] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on netbox1003.eqiad.wmnet with reason: netbox upgrade prep work [17:09:18] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [17:09:20] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:09:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on netbox2003.codfw.wmnet with reason: netbox upgrade prep work [17:11:10] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:11:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10003777 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:14:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10003815 (10cmooney) >>! In T364870#10001348, @ayounsi wrote: > There is an outstanding diff on the switch for `cloudcephmon1006`. It looks correct, but c... [17:19:30] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:23:01] (03CR) 10Jforrester: "Must wait for wmf.16 (and the i18n being loaded)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [17:23:05] (03CR) 10Jforrester: [C:04-1] beta: Enable NetworkSession extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055484 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [17:29:51] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055489/3353/aphlict1002.eqiad.wmnet/change.aphlict1002.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (owner: 10Dzahn) [17:32:58] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [17:33:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10003913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye e... [17:33:20] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [17:34:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10003910 (10cmooney) @Jclark-ctr I fixed up some issues in Netbox for cloudcephmon1006 (was on the wrong primary vlan, and had an IP from a cloud-private-... [17:39:28] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1055492/3354/lists1004.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (owner: 10Dzahn) [17:41:31] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new cloudceph nodes - cmooney@cumin1002" [17:42:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new cloudceph nodes - cmooney@cumin1002" [17:42:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:53:08] (03PS1) 10Catrope: Add Chart extension, enable in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) [17:53:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Update codfw LVS connectivity to support new LSW in rows C & D - https://phabricator.wikimedia.org/T370635#10004023 (10cmooney) [17:53:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10004024 (10cmooney) [17:53:23] (03CR) 10Catrope: [C:04-2] "Not before July 30th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055984 (https://phabricator.wikimedia.org/T369945) (owner: 10Catrope) [17:53:40] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055495/3355/vrts1001.eqiad.wmnet/change.vrts1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (owner: 10Dzahn) [17:59:57] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055494/3356/releases1003.eqiad.wmnet/change.releases1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (owner: 10Dzahn) [18:00:02] (03PS2) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 [18:01:12] (03CR) 10AOkoth: [C:03+2] vrts: use curl with -x flag for proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1055297 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:01:23] (03CR) 10Dzahn: [V:04-1] releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (owner: 10Dzahn) [18:01:48] (03PS2) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 [18:04:13] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055493/3358/phab1004.eqiad.wmnet/change.phab1004.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (owner: 10Dzahn) [18:05:22] (03Merged) 10jenkins-bot: vrts: use curl with -x flag for proxy [cookbooks] - 10https://gerrit.wikimedia.org/r/1055297 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:05:30] (03PS5) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [18:05:30] (03CR) 10Ebernhardson: Produce a limited set of event streams on private wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [18:10:25] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1055491/3360/contint1002.wikimedia.org/change.contint1002.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (owner: 10Dzahn) [18:10:33] (03CR) 10Ssingh: "Can you add the PCC output here (which I am assuming you ran)." [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:12:00] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts2001.codfw.wmnet [18:12:10] !log aokoth@cumin1002 END (ERROR) - Cookbook sre.vrts.upgrade (exit_code=97) on VRTS host vrts2001.codfw.wmnet [18:12:22] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts2001.codfw.wmnet [18:13:06] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts2001.codfw.wmnet [18:14:26] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3361/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:15:27] (03CR) 10CDobbins: "Started by upstream project "operations-puppet-catalog-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:17:41] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3362/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:18:17] (03PS1) 10AOkoth: vrts: fix extract [cookbooks] - 10https://gerrit.wikimedia.org/r/1055988 (https://phabricator.wikimedia.org/T366078) [18:20:47] (03CR) 10Dzahn: "@Jelto what's weird is this (nowadays) fails in the same way for almost every other service.. except.. also not for all. it works fine on " [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (owner: 10Dzahn) [18:24:38] (03PS2) 10AOkoth: vrts: fix extract [cookbooks] - 10https://gerrit.wikimedia.org/r/1055988 (https://phabricator.wikimedia.org/T366078) [18:27:41] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts2001.codfw.wmnet [18:27:55] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.vrts.upgrade (exit_code=99) on VRTS host vrts2001.codfw.wmnet [18:38:39] (03CR) 10Ssingh: [C:03+1] "Looks good, nice work! Let's merge it tomorrow by disabling Puppet on A:cp and then testing it out on one host before rolling to all other" [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:40:23] (03CR) 10AOkoth: "Tested with https://wikitech.wikimedia.org/wiki/Spicerack/Cookbooks#Test_before_merging against vrts2001. Works as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/1055988 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:40:30] (03CR) 10AOkoth: [C:03+2] vrts: fix extract [cookbooks] - 10https://gerrit.wikimedia.org/r/1055988 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:41:19] (03PS1) 10Michael Große: HACK: add option to checked-disable checkboxes [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055995 (https://phabricator.wikimedia.org/T370611) [18:41:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055995 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [18:43:14] (03PS1) 10Dzahn: planet: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) [18:43:15] (03PS1) 10Michael Große: HACK: show structured link task as disabled if frontend flag is true [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) [18:43:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [18:44:15] (03Merged) 10jenkins-bot: vrts: fix extract [cookbooks] - 10https://gerrit.wikimedia.org/r/1055988 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:45:13] (03CR) 10Ssingh: "let's rebase this and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins) [18:45:40] (03CR) 10Ssingh: "@aotto@wikimedia.org: Hi! I guess we can abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:46:29] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1055996/3364/planet1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055996 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:48:16] (03PS1) 10Bartosz Dziewoński: sshkey_list: Fix stray quote [software/bitu] - 10https://gerrit.wikimedia.org/r/1055998 [18:49:02] (03CR) 10BCornwall: "oh? What makes you think so?" [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [18:49:34] (03PS2) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 [18:51:16] (03PS1) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [18:51:55] (03CR) 10CI reject: [V:04-1] deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [18:54:17] (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055996 first .. now on to a different issue:" [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (owner: 10Dzahn) [18:54:24] (03PS1) 10Ssingh: hiera: dns6001: reduce anycast_hc logging level and backups [puppet] - 10https://gerrit.wikimedia.org/r/1056000 (https://phabricator.wikimedia.org/T370068) [18:54:46] (03PS1) 10Scott French: mediawiki: fetch active deployment host [software/spicerack] - 10https://gerrit.wikimedia.org/r/1056001 (https://phabricator.wikimedia.org/T369921) [18:54:55] (03PS65) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [18:55:27] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1056000 (https://phabricator.wikimedia.org/T370068) (owner: 10Ssingh) [18:58:22] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [18:59:59] (03PS3) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 [19:00:15] (03PS1) 10Bartosz Dziewoński: Fix incomplete table.vertical styles causing broken layout [software/bitu] - 10https://gerrit.wikimedia.org/r/1056002 [19:02:42] (03PS2) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [19:03:48] (03PS4) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 [19:05:21] (03PS1) 10Bartosz Dziewoński: SpecialMovePage: fix logic to check `delete-redirect` [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056003 (https://phabricator.wikimedia.org/T370669) [19:05:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-" [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056003 (https://phabricator.wikimedia.org/T370669) (owner: 10Bartosz Dziewoński) [19:06:59] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:07:12] (03CR) 10Ottomata: "One more nit! Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [19:08:56] (03CR) 10Ottomata: "Hi! No, we will come back to it soon. We can try again next week:" [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:09:21] (03CR) 10Ssingh: "Sure happy to, just ping us :)" [puppet] - 10https://gerrit.wikimedia.org/r/1050017 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [19:09:52] (03CR) 10Ssingh: [C:03+1] "https://prometheus.io/docs/prometheus/latest/querying/functions/#increase" [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [19:10:41] (03CR) 10Andrea Denisse: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3368/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:13:35] (03PS3) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [19:14:15] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: clone grafana-grizzly with default parameters [puppet] - 10https://gerrit.wikimedia.org/r/1054892 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [19:14:56] (03PS5) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 [19:16:57] (03CR) 10CI reject: [V:04-1] HACK: show structured link task as disabled if frontend flag is true [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [19:17:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:21] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:21:50] (03CR) 10Dzahn: "Do you want it to run on all deployment servers, including the one currently not active and deployment servers in cloud? Or really just on" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:25:15] (03CR) 10Michael Große: "recheck" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [19:28:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 12.04% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:31:05] (03PS1) 10Ahmon Dancy: MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056004 (https://phabricator.wikimedia.org/T369115) [19:31:57] (03PS2) 10Ahmon Dancy: MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056004 (https://phabricator.wikimedia.org/T369115) [19:33:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056004 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:33:46] (03Merged) 10jenkins-bot: MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056004 (https://phabricator.wikimedia.org/T369115) (owner: 10Ahmon Dancy) [19:34:02] !log dancy@deploy1002 Started scap sync-world: Backport for [[gerrit:1056004|MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION (T369115)]] [19:34:06] T369115: [WE6.2.1] Publish pre-train single version containers - https://phabricator.wikimedia.org/T369115 [19:36:53] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1056005 [19:37:25] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1056005 (owner: 10Ahmon Dancy) [19:37:30] (03CR) 10Ahmon Dancy: [V:03+2 C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1056005 (owner: 10Ahmon Dancy) [19:38:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:47:49] !log dancy@deploy1002 dancy: Backport for [[gerrit:1056004|MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION (T369115)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:47:53] T369115: [WE6.2.1] Publish pre-train single version containers - https://phabricator.wikimedia.org/T369115 [19:47:57] !log dancy@deploy1002 dancy: Continuing with sync [19:54:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:54:24] !log dancy@deploy1002 Finished scap: Backport for [[gerrit:1056004|MWMultiVersion.php: Use FORCE_MW_VERSION instead of MW_FORCE_VERSION (T369115)]] (duration: 20m 22s) [19:54:28] T369115: [WE6.2.1] Publish pre-train single version containers - https://phabricator.wikimedia.org/T369115 [19:55:11] (03PS3) 10Dzahn: releases: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055494 (https://phabricator.wikimedia.org/T370677) [19:55:25] (03PS6) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [19:55:34] (03PS2) 10Dzahn: lists: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) [19:55:43] (03PS3) 10Dzahn: phabricator: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055493 (https://phabricator.wikimedia.org/T370677) [19:56:00] (03PS2) 10Dzahn: ci: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055491 (https://phabricator.wikimedia.org/T370677) [19:56:10] (03PS2) 10Dzahn: aphlict: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055489 (https://phabricator.wikimedia.org/T370677) [19:56:15] (03PS1) 10Dzahn: phabricator: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1056006 (https://phabricator.wikimedia.org/T370677) [19:56:42] (03PS2) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) [19:56:51] (03PS2) 10Dzahn: vrts: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055495 (https://phabricator.wikimedia.org/T370677) [19:59:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 22.65% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:59:20] (03PS3) 10Dzahn: doc: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (https://phabricator.wikimedia.org/T370677) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T2000). nyaa~ [20:00:04] Sohom_Datta, RoanKattouw, MichaelG_WMF, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] * MichaelG_WMF is here [20:00:17] hi [20:00:23] o/ [20:01:32] (03CR) 10Catrope: [C:03+2] HACK: add option to checked-disable checkboxes [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055995 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [20:01:55] (03CR) 10Catrope: [C:03+2] HACK: show structured link task as disabled if frontend flag is true [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [20:02:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [20:02:26] Hi everyone, I'll do the deployment today [20:02:43] Thank you 🙏 [20:03:16] (03PS4) 10Catrope: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) [20:03:20] (03CR) 10TrainBranchBot: "Approved by catrope@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [20:03:56] Sounds good :) [20:03:56] (03PS1) 10Andrew Bogott: cloud-vps puppetservers: remove use of the 'gitpuppet' user [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) [20:04:03] (03Merged) 10jenkins-bot: Work around T370517 by remapping the affected i18n message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055480 (https://phabricator.wikimedia.org/T370517) (owner: 10Catrope) [20:04:21] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1055480|Work around T370517 by remapping the affected i18n message (T370517)]] [20:04:25] T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517 [20:06:33] (03CR) 10Dzahn: [C:04-1] "issues from:" [puppet] - 10https://gerrit.wikimedia.org/r/1055488 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:06:37] !log catrope@deploy1002 catrope: Backport for [[gerrit:1055480|Work around T370517 by remapping the affected i18n message (T370517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:07:44] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056010 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [20:07:48] This isn't really testable because the message blob rebuild script needs to run, so I'm going to proceed [20:07:50] !log catrope@deploy1002 catrope: Continuing with sync [20:07:58] (03PS7) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [20:12:46] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1055480|Work around T370517 by remapping the affected i18n message (T370517)]] (duration: 08m 24s) [20:12:50] T370517: Search button message text changes - https://phabricator.wikimedia.org/T370517 [20:15:49] (03PS1) 10Dreamy Jazz: Define wgGlobalBlockingCentralWiki as 'metawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056014 (https://phabricator.wikimedia.org/T370457) [20:16:17] (03PS2) 10Dreamy Jazz: Define wgGlobalBlockingCentralWiki as 'metawiki' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1056014 (https://phabricator.wikimedia.org/T370457) [20:16:28] (03CR) 10Scott French: "Thanks, @dzahn@wikimedia.org! Indeed, I just happened to start with two example hosts that I know to work and be in active use. Agreed tha" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [20:16:39] (03CR) 10Catrope: [C:03+2] SpecialMovePage: fix logic to check `delete-redirect` [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056003 (https://phabricator.wikimedia.org/T370669) (owner: 10Bartosz Dziewoński) [20:21:17] (03PS8) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [20:21:51] (03CR) 10Catrope: [C:03+2] Do not unreview pages when they are moved [extensions/PageTriage] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055941 (https://phabricator.wikimedia.org/T370593) (owner: 10Sohom Datta) [20:22:15] My patch is working -- for all the others we're waiting on CI [20:22:39] 6 more minutes [20:23:02] RoanKattouw: My changes are only testable when deployed together [20:23:19] OK good to know. I was planning to deploy them together anyway [20:23:28] 👍 [20:26:45] (03CR) 10Dzahn: "I meant more like "do not run it unless it's the active production deployment server" since the "ensure => present" is hardcoded in module" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [20:28:42] (03Merged) 10jenkins-bot: HACK: add option to checked-disable checkboxes [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055995 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [20:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:34:59] Where is that time estimate for those CI pipelines coming from anyway? For some it feels pretty accurate, for others it can be quite off [20:39:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [20:39:33] (03PS6) 10Ebernhardson: Produce a limited set of event streams on private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) [20:39:33] (03CR) 10Ebernhardson: Produce a limited set of event streams on private wikis (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [20:40:03] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:40:28] (03Merged) 10jenkins-bot: HACK: show structured link task as disabled if frontend flag is true [extensions/GrowthExperiments] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055997 (https://phabricator.wikimedia.org/T370611) (owner: 10Michael Große) [20:40:45] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1055995|HACK: add option to checked-disable checkboxes (T370611)]], [[gerrit:1055997|HACK: show structured link task as disabled if frontend flag is true (T370611)]] [20:40:50] T370611: CommunityConfiguration: `Add a link (Structured task)` fix handling when "backend" is enabled & "frontend" is disabled - https://phabricator.wikimedia.org/T370611 [20:43:00] !log catrope@deploy1002 catrope, migr: Backport for [[gerrit:1055995|HACK: add option to checked-disable checkboxes (T370611)]], [[gerrit:1055997|HACK: show structured link task as disabled if frontend flag is true (T370611)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:43:04] @RoanKattouw it works as expected. Thank you! [20:43:12] You confirmed that in 4 seconds?! [20:43:33] it was actually available before you posted the message [20:43:44] (I could not type that text in 4 seconds) [20:44:20] Ah I see, so you were just refreshing the test servers constantly :) [20:44:22] !log catrope@deploy1002 catrope, migr: Continuing with sync [20:44:28] confirmed showing the checkbox disabled on https://en.wikipedia.org/wiki/Special:CommunityConfiguration/GrowthSuggestedEdits and not disabled on https://es.wikipedia.org/wiki/Especial:Configuraci%C3%B3n_comunitaria/GrowthSuggestedEdits [20:45:02] not constantly, but ... _optimistically_ ^^ [20:45:51] (03Merged) 10jenkins-bot: SpecialMovePage: fix logic to check `delete-redirect` [core] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1056003 (https://phabricator.wikimedia.org/T370669) (owner: 10Bartosz Dziewoński) [20:46:10] my maintenance script is independent from the patch, btw [20:46:10] !log applying additional address to pfw3-codfw reth0.2140 to provide space for new hosts (T370164) [20:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:19] T370164: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164 [20:47:58] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new pfw3-codfw mgmt IP - cmooney@cumin1002" [20:49:13] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1055995|HACK: add option to checked-disable checkboxes (T370611)]], [[gerrit:1055997|HACK: show structured link task as disabled if frontend flag is true (T370611)]] (duration: 08m 27s) [20:49:17] T370611: CommunityConfiguration: `Add a link (Structured task)` fix handling when "backend" is enabled & "frontend" is disabled - https://phabricator.wikimedia.org/T370611 [20:49:44] MatmaRex: Yours is next [20:50:06] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1056003|SpecialMovePage: fix logic to check `delete-redirect` (T370669)]] [20:50:12] T370669: Impossible to move page A to B if B is redirected to C, even with delete rights - https://phabricator.wikimedia.org/T370669 [20:50:18] (03Merged) 10jenkins-bot: Do not unreview pages when they are moved [extensions/PageTriage] (wmf/1.43.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1055941 (https://phabricator.wikimedia.org/T370593) (owner: 10Sohom Datta) [20:52:39] !log catrope@deploy1002 catrope, matmarex: Backport for [[gerrit:1056003|SpecialMovePage: fix logic to check `delete-redirect` (T370669)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:13] i'm a little confused because i was testing, and i just failed to reproduce the bug on non-testservers [20:55:40] The code definitely looked wrong though! [20:56:10] yeah it's definitely wrong. i must have some different user rights than the bug reporter [20:56:17] or maybe the config on testwiki is different [20:56:18] Maybe you have both the delete-redirect and delete right? It looks like the bug is specific to having delete but not delete-redirect [20:57:10] (https://test.wikipedia.org/wiki/Special:Log that move and delete should not have succeeded) [20:58:15] (03PS1) 10Cathal Mooney: Add include in 10/8 reverse zone for new frack codfw mgmt range [dns] - 10https://gerrit.wikimedia.org/r/1056017 (https://phabricator.wikimedia.org/T370164) [20:58:26] the delete-redirect right doesn't exist on most wikis [20:58:55] (and it seems i don't have it) [21:00:05] Reedy, sbassett, Maryum, and manfredi: Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240722T2100). Please do the needful. [21:00:38] The backport window isn't done yet, sorry [21:01:04] MatmaRex: Well I guess if nothing appears to be broken with the patch either, I suppose we could just sync it and ask the reporters on jawiki whether it improved things for them? [21:01:23] i still don't understand how that happened [21:01:28] RoanKattouw: i suppose. the old code is definitely wrong [21:01:57] oh, i see [21:02:17] it was because the page i made was a redirect, but not a *single-rev* redirect [21:03:31] and now i can reproduce the error [21:03:51] and i can also verify the fix [21:04:02] RoanKattouw: please proceed :) sorry about that, thanks for waiting [21:04:32] !log catrope@deploy1002 catrope, matmarex: Continuing with sync [21:05:01] (03CR) 10Dzahn: [C:04-1] "@Jelto the issue we see on many hosts with few exceptions comes from including profile::tlsproxy::envoy. So it will need a patch in there " [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:05:15] (03PS4) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [21:06:25] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [21:09:14] (03CR) 10Dzahn: "my previous comments can be safely ignored since this doesn't come with a timer or anything running the script, just installs a file.. so " [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [21:09:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:09:19] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1056003|SpecialMovePage: fix logic to check `delete-redirect` (T370669)]] (duration: 19m 12s) [21:09:23] T370669: Impossible to move page A to B if B is redirected to C, even with delete rights - https://phabricator.wikimedia.org/T370669 [21:10:06] !log catrope@deploy1002 Started scap sync-world: Backport for [[gerrit:1055941|Do not unreview pages when they are moved (T370593)]] [21:10:09] Sohom_Datta: Doing your patch now, sorry for the delay [21:10:11] T370593: Moving a page always marks it as unreviewed - https://phabricator.wikimedia.org/T370593 [21:10:32] (03CR) 10Scott French: "Got it, thanks for the follow-up. Summarizing discussion out of band, there is no timer resource created by this (just a script and associ" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [21:10:44] No issues :) [21:12:32] !log catrope@deploy1002 catrope, soda: Backport for [[gerrit:1055941|Do not unreview pages when they are moved (T370593)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:45] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 21.9% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:17:01] (03PS9) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [21:18:45] (03CR) 10Ssingh: [C:03+1] Add include in 10/8 reverse zone for new frack codfw mgmt range [dns] - 10https://gerrit.wikimedia.org/r/1056017 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [21:19:00] RoanKattouw: Just tested, looks good [21:19:40] (03CR) 10Cathal Mooney: [C:03+2] Add include in 10/8 reverse zone for new frack codfw mgmt range [dns] - 10https://gerrit.wikimedia.org/r/1056017 (https://phabricator.wikimedia.org/T370164) (owner: 10Cathal Mooney) [21:19:45] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:22:56] (03PS1) 10Cathal Mooney: Add new mgmt range for frack codfw to network defs [puppet] - 10https://gerrit.wikimedia.org/r/1056026 (https://phabricator.wikimedia.org/T370164) [21:24:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for new pfw3-codfw mgmt IP - cmooney@cumin1002" [21:24:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:25:24] !log catrope@deploy1002 catrope, soda: Continuing with sync [21:25:43] (03PS5) 10Scott French: deployment_server: install the cache warmup script [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) [21:25:53] (03PS10) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [21:26:06] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10005058 (10cmooney) OK I have allocated 10.195.1.0/25 in Netbox and configured 10.195.1.1 as a secondary IP on pfw3-codfw on the mgmt int... [21:26:34] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [21:29:30] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:30:34] !log catrope@deploy1002 Finished scap: Backport for [[gerrit:1055941|Do not unreview pages when they are moved (T370593)]] (duration: 20m 27s) [21:30:39] T370593: Moving a page always marks it as unreviewed - https://phabricator.wikimedia.org/T370593 [21:30:43] Alright, backport window is done [21:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 19.55% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:32:55] (03PS11) 10Dzahn: planet: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) [21:33:44] (03PS1) 10Cathal Mooney: Widen netmak for allowed in BGP prefixes codfw frack [homer/public] - 10https://gerrit.wikimedia.org/r/1056029 (https://phabricator.wikimedia.org/T370164) [21:35:38] (03PS1) 10Scott French: deployment_server: noop edit to test PCC [puppet] - 10https://gerrit.wikimedia.org/r/1056030 [21:35:54] (03CR) 10Ottomata: [C:03+1] "+1 for when things are ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [21:36:00] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1056030 (owner: 10Scott French) [21:36:05] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10005101 (10Papaul) @Jhancock.wm is possible to get this on a 10G rack if not it's ok. Thanks [21:36:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:37:34] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1055490/3376/planet1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1055490 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [21:38:17] (03CR) 10Ottomata: "Hm, actually, you probably should separate these two files into different deployments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [21:38:43] (03CR) 10Ottomata: "(unresolving comment)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055275 (https://phabricator.wikimedia.org/T346046) (owner: 10Ebernhardson) [21:42:52] (03CR) 10Dzahn: "this looks good to me.. except what has already been mentioned on the meeting .. maybe we want to make this a generic class that can be re" [puppet] - 10https://gerrit.wikimedia.org/r/1055886 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:44:03] (03PS1) 10Cathal Mooney: Add monitoring definitions for new codfw row C/D switches [puppet] - 10https://gerrit.wikimedia.org/r/1056031 (https://phabricator.wikimedia.org/T369106) [21:51:35] (03PS6) 10Scott French: mediawiki-cache-warmup: support 'clone' for mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) [21:52:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1004.eqiad.wmnet with OS bullseye [21:52:45] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10005144 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:00:39] (03PS3) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) [22:01:41] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1005.eqiad.wmnet with OS bullseye [22:01:46] (03CR) 10Scott French: "After a bit of thought, I realized that continuing down the path of [0] was not much less effort than just making clone work on kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/1054968 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [22:01:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10005163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye [22:04:33] (03PS4) 10Dzahn: etherpad: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) [22:05:36] (03CR) 10Scott French: switchdc: prepare mediawiki cache warmup for bare-metal turndown (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [22:06:17] (03Abandoned) 10Scott French: switchdc: prepare mediawiki cache warmup for bare-metal turndown [cookbooks] - 10https://gerrit.wikimedia.org/r/1053823 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [22:07:44] (03CR) 10Dzahn: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1055496/3378/" [puppet] - 10https://gerrit.wikimedia.org/r/1055496 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [22:34:52] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T348977 [22:34:56] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [22:35:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T348977 [22:36:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic110[0-2]* for T348977 - bking@cumin2002 [22:36:56] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic110[0-2]* for T348977 - bking@cumin2002 [22:38:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1004.eqiad.wmnet with OS bullseye [22:38:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10005224 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1004.eqiad.wmnet with OS bullseye ex... [22:39:17] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#10005223 (10bking) `elastic110[0-2]` are banned and ready , as is `wdqs1016`. [22:47:47] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1005.eqiad.wmnet with OS bullseye [22:47:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10005229 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye ex... [23:03:56] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [23:05:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:07:06] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "set lsw in codfw to active - cmooney@cumin1002" [23:08:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "set lsw in codfw to active - cmooney@cumin1002" [23:17:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:58] (03CR) 10BCornwall: "Oh, interesting:" [alerts] - 10https://gerrit.wikimedia.org/r/1055498 (https://phabricator.wikimedia.org/T362833) (owner: 10BCornwall) [23:31:33] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (owner: 10XXBlackburnXx) [23:32:44] (03PS3) 10Pppery: Update nlwiki AbuseFilter config per consensus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (https://phabricator.wikimedia.org/T370605) (owner: 10XXBlackburnXx) [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056048 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1056048 (owner: 10TrainBranchBot) [23:39:20] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:45:09] (03Abandoned) 10Arlolra: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055249 (owner: 10PipelineBot) [23:57:44] (03CR) 10Eevans: [C:03+1] Prepare for more new-style ms-be nodes [puppet] - 10https://gerrit.wikimedia.org/r/1055254 (https://phabricator.wikimedia.org/T368928) (owner: 10MVernon)