[00:00:25] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10065688 (10Dwisehaupt) frdb2004 OS install complete. Will clone the DB across tomorrow. [00:09:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1062767 (owner: 10TrainBranchBot) [00:15:58] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [00:19:11] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1062754 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [00:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:29:41] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [00:49:19] (03PS1) 10Eevans: aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) [00:52:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10065708 (10Eevans) [00:52:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:26] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:17] (03CR) 10Scott French: [C:03+1] "No worries! And thanks for your patience while it took me a little while to get back to this." [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/1060843 (https://phabricator.wikimedia.org/T337928) (owner: 10JMeybohm) [02:23:39] !log milimetric@deploy1003 Started deploy [airflow-dags/analytics@02f37cf]: (no justification provided) [02:24:22] !log milimetric@deploy1003 Finished deploy [airflow-dags/analytics@02f37cf]: (no justification provided) (duration: 00m 43s) [02:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:25] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:26] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [03:24:24] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10065781 (10Papaul) @Dwisehaupt you had the DNS information under Description and not under DNS Name see below ` IP Address Family IPv4 VRF Global Tenant Fundraising Tech S... [03:26:50] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix mgmt DNS fro fd2004 - pt1979@cumin2002" [03:26:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix mgmt DNS fro fd2004 - pt1979@cumin2002" [03:26:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:39:51] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10065796 (10phaultfinder) [03:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:27:48] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:49:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T372393 [04:49:25] T372393: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T372393 [04:49:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1189 with weight 0 T372393', diff saved to https://phabricator.wikimedia.org/P67323 and previous config saved to /var/cache/conftool/dbconfig/20240815-044929-root.json [04:49:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T372393 [04:49:58] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1189 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1062381 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot) [04:55:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Stop MariaDB on db1238 T371342 [04:55:40] T371342: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342 [04:55:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Stop MariaDB on db1238 T371342 [04:56:30] (03PS1) 10Marostegui: db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1062780 (https://phabricator.wikimedia.org/T371342) [04:59:23] (03CR) 10Marostegui: [C:03+2] db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1062780 (https://phabricator.wikimedia.org/T371342) (owner: 10Marostegui) [05:01:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, and 2 others: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10065828 (10Marostegui) @VRiley-WMF you can proceed whenever you want. The host is ready. [05:03:56] !log Starting s3 eqiad failover from db1223 to db1189 - T372393 [05:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:59] T372393: Switchover s3 master (db1223 -> db1189) - https://phabricator.wikimedia.org/T372393 [05:04:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T372393', diff saved to https://phabricator.wikimedia.org/P67324 and previous config saved to /var/cache/conftool/dbconfig/20240815-050410-root.json [05:04:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1189 to s3 primary and set section read-write T372393', diff saved to https://phabricator.wikimedia.org/P67325 and previous config saved to /var/cache/conftool/dbconfig/20240815-050428-root.json [05:04:59] (03PS2) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) [05:05:15] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot) [05:05:17] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062382 (https://phabricator.wikimedia.org/T372393) (owner: 10Gerrit maintenance bot) [05:06:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1223 T372393', diff saved to https://phabricator.wikimedia.org/P67326 and previous config saved to /var/cache/conftool/dbconfig/20240815-050613-root.json [05:07:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67327 and previous config saved to /var/cache/conftool/dbconfig/20240815-050701-root.json [05:08:59] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1062782 (https://phabricator.wikimedia.org/T372524) [05:09:04] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) [05:17:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67328 and previous config saved to /var/cache/conftool/dbconfig/20240815-052206-root.json [05:29:59] (03PS1) 10Marostegui: installserver: Do not reimage db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1062788 [05:33:01] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db2223 [puppet] - 10https://gerrit.wikimedia.org/r/1062788 (owner: 10Marostegui) [05:37:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67329 and previous config saved to /var/cache/conftool/dbconfig/20240815-053712-root.json [05:52:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67330 and previous config saved to /var/cache/conftool/dbconfig/20240815-055218-root.json [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0600) [06:00:05] marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0600). [06:00:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67331 and previous config saved to /var/cache/conftool/dbconfig/20240815-060723-root.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from eventstreams.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=eqiad&var-cluster=text&var-origin=eventstreams.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [06:22:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67332 and previous config saved to /var/cache/conftool/dbconfig/20240815-062229-root.json [06:31:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [06:31:44] Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ... [06:31:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [06:37:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67333 and previous config saved to /var/cache/conftool/dbconfig/20240815-063734-root.json [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:09:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [07:10:03] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10065976 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye [07:17:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main10[06-10] - https://phabricator.wikimedia.org/T371422#10066002 (10JMeybohm) >>! In T371422#10064622, @VRiley-WMF wrote: > I have allocated these drives added these SSDs to specified servers. Please test it out and let us k... [07:31:01] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts [07:31:04] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [07:31:16] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 10:00:00 on 9 hosts with reason: T364368 non-prod hosts [07:47:10] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [07:47:18] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye executed with error... [07:49:11] (03CR) 10MVernon: [C:03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [07:50:24] (03PS1) 10David Caro: ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) [08:00:05] jeena and jnuche: OwO what's this, a deployment window?? MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T0800). nyaa~ [08:00:24] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye [08:00:36] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066038 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with error... [08:01:39] (03PS3) 10AOkoth: vrts: build & install packages [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) [08:04:29] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [08:04:42] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye [08:04:50] (03CR) 10Kevin Bazira: [C:03+1] ml-services: payload logging in revscoring-mp-articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062721 (owner: 10Ilias Sarantopoulos) [08:16:41] (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [08:16:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [08:16:44] Deployment eventstreams-production in eventstreams at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s&var-namespace=eventstreams&var-deployment=eventstreams-production - ... [08:16:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [08:16:49] (03PS5) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) [08:18:56] (03CR) 10Stevemunene: dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [08:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:45:54] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066125 (10Ifeatu_Nnaobi_WMDE) >>! In T371796#10043749, @Dzahn wrote: > @ifeatu_nnaobi_wmde Could you please send an email to [[ https://meta.wikimedia.org/wiki/U... [08:55:13] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2006.codfw.wmnet with OS bullseye [08:55:33] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye executed with error... [08:59:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3641/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [09:05:04] (03CR) 10Jelto: [V:03+1 C:03+2] "I have some doubts if there are any untracked packets at all but let's try this and see if Gerrit throttling behaves differently" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [09:08:47] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066183 (10eoghan) a:03Ifeatu_Nnaobi_WMDE [09:09:08] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10066184 (10eoghan) a:05Ifeatu_Nnaobi_WMDE→03eoghan [09:11:36] (03CR) 10Jelto: [V:03+1 C:03+2] "After merging this I'm added to the DENYLIST immediately when visiting my Gerrit dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [09:17:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2152.codfw.wmnet with reason: Maintenance [09:24:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2152.codfw.wmnet with reason: Maintenance [09:25:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T367856)', diff saved to https://phabricator.wikimedia.org/P67334 and previous config saved to /var/cache/conftool/dbconfig/20240815-092502-marostegui.json [09:25:06] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:27:20] (03CR) 10Stevemunene: [C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [09:27:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change [09:27:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2152.codfw.wmnet with reason: Schema change [09:35:43] (03CR) 10Stevemunene: [V:03+2 C:03+2] dns: provision airflow-test-k8s temp domain [dns] - 10https://gerrit.wikimedia.org/r/1062048 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [09:39:27] (03PS1) 10JMeybohm: preseed: Switch to reuse receipts for kafka-main20(06,09,10) [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423) [09:43:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066264 (10VRiley-WMF) Worked with Dell on this. Orginally, they wanted to update firmware before anything else. However, I provided them with TSR reports. They informed me it'... [09:43:14] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [09:44:13] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066265 (10Marostegui) Thanks! I am going to get it online and I will close the task once it is repooled today. If we see something strange we'll reopen Thank you! [09:45:16] (03PS1) 10Marostegui: Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1062968 [09:48:17] (03CR) 10JMeybohm: [C:03+2] preseed: Switch to reuse receipts for kafka-main20(06,09,10) [puppet] - 10https://gerrit.wikimedia.org/r/1062967 (https://phabricator.wikimedia.org/T371423) (owner: 10JMeybohm) [09:50:58] (03CR) 10Marostegui: [C:03+2] Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1062968 (owner: 10Marostegui) [09:55:18] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2006.codfw.wmnet with OS bullseye [09:55:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066344 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bu... [09:56:53] (03PS1) 10Hnowlan: php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1000) [10:03:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066386 (10VRiley-WMF) 05Open→03Resolved Awesome, I'll go ahead and mark this as resolved for now. [10:04:07] (03CR) 10David Caro: [C:03+2] wmcs: enable mypy on all our modules [puppet] - 10https://gerrit.wikimedia.org/r/1060800 (owner: 10David Caro) [10:04:10] (03CR) 10David Caro: [C:03+2] wmcs.db.wikireplicas: add mypy checks and fix issues [puppet] - 10https://gerrit.wikimedia.org/r/1060794 (owner: 10David Caro) [10:06:02] (03CR) 10Kamila Součková: [C:03+1] php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521) (owner: 10Hnowlan) [10:10:31] (03CR) 10Hnowlan: [V:03+2 C:03+2] php: fix bug in min_avail_workers healthz behaviour [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1062971 (https://phabricator.wikimedia.org/T372521) (owner: 10Hnowlan) [10:11:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67335 and previous config saved to /var/cache/conftool/dbconfig/20240815-101139-root.json [10:11:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10066428 (10Marostegui) Host being automatically repooled. [10:15:25] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [10:18:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2006.codfw.wmnet with reason: host reimage [10:19:56] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:21:37] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:22:51] (03PS1) 10Marostegui: control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536) [10:23:29] (03CR) 10Marostegui: [C:03+2] control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536) (owner: 10Marostegui) [10:23:57] (03Merged) 10jenkins-bot: control-mariadb-10.6-bookworm: 10.6.19 is out [software] - 10https://gerrit.wikimedia.org/r/1062975 (https://phabricator.wikimedia.org/T372536) (owner: 10Marostegui) [10:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67336 and previous config saved to /var/cache/conftool/dbconfig/20240815-102645-root.json [10:27:10] !log Install 10.6.19 on pc1014 db1125 pc2014 T372536 [10:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:12] T372536: Compile and package MariaDB 10.6.19 - https://phabricator.wikimedia.org/T372536 [10:27:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc2014.codfw.wmnet with reason: Upgrade to 10.6.19 [10:27:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet with reason: Upgrade to 10.6.19 [10:28:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on pc1014.eqiad.wmnet with reason: Upgrade to 10.6.19 [10:28:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc1014.eqiad.wmnet with reason: Upgrade to 10.6.19 [10:28:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1125.eqiad.wmnet with reason: Upgrade to 10.6.19 [10:29:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1125.eqiad.wmnet with reason: Upgrade to 10.6.19 [10:36:05] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2006.codfw.wmnet with OS bullseye [10:37:21] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2006.codfw.wmnet with OS bullseye completed: - kafka-... [10:41:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67337 and previous config saved to /var/cache/conftool/dbconfig/20240815-104150-root.json [10:49:10] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for SaraiSan WMF - https://phabricator.wikimedia.org/T372290#10066506 (10eoghan) 05Open→03Resolved Confirmed working! [10:51:42] (03CR) 10Hnowlan: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:56:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67338 and previous config saved to /var/cache/conftool/dbconfig/20240815-105656-root.json [11:00:47] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:04:31] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:12:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67339 and previous config saved to /var/cache/conftool/dbconfig/20240815-111201-root.json [11:12:30] (03PS1) 10Hnowlan: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048) [11:24:41] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:27:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67340 and previous config saved to /var/cache/conftool/dbconfig/20240815-112707-root.json [11:27:27] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:30:03] (03PS3) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [11:30:27] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [11:42:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67341 and previous config saved to /var/cache/conftool/dbconfig/20240815-114213-root.json [11:44:12] (03PS4) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [11:44:35] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [11:48:14] (03PS5) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [11:48:39] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1200) [12:08:11] (03PS6) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:09:45] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:09:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2009.codfw.wmnet with OS bullseye [12:10:00] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2010.codfw.wmnet with OS bullseye [12:10:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066693 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye [12:10:08] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye [12:12:36] (03PS7) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:14:14] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:14:48] (03CR) 10Klausman: [C:03+2] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [12:15:44] (03Abandoned) 10Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [12:16:13] (03Restored) 10Klausman: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [12:16:53] (03CR) 10Klausman: [V:03+2 C:03+2] knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [12:16:59] (03PS1) 10Seddon: Save the request before starting the automatic vanish job [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) [12:17:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon) [12:18:29] (03PS8) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:20:02] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:20:07] (03Merged) 10jenkins-bot: knative-serving: Switch components to use Calico Netpolicies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060452 (owner: 10Klausman) [12:20:32] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062999 [12:23:29] !log klausman@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:25:11] !log klausman@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:26:24] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:26:55] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:28:57] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [12:29:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [12:30:41] (03PS9) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:32:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2010.codfw.wmnet with reason: host reimage [12:34:56] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2009.codfw.wmnet with reason: host reimage [12:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:47:24] (03PS10) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:47:49] (03CR) 10CI reject: [V:04-1] Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:49:47] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2010.codfw.wmnet with OS bullseye [12:49:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2010.codfw.wmnet with OS bullseye completed: - kafka-... [12:50:16] (03PS11) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:51:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3650/console" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:52:30] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2009.codfw.wmnet with OS bullseye [12:52:41] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066741 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2009.codfw.wmnet with OS bullseye completed: - kafka-... [12:52:52] (03PS12) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:54:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [12:55:55] (03PS13) 10Btullis: Add a non-free component to the apt private repo [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) [12:57:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3652/co" [puppet] - 10https://gerrit.wikimedia.org/r/1062401 (https://phabricator.wikimedia.org/T370203) (owner: 10Btullis) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1300). [13:00:05] seddon: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:48] I can probably deploy in a few minutes [13:04:01] (if Seddon is around) [13:04:06] I am1 [13:04:07] ! [13:04:25] ok, then I can deploy now [13:05:32] Cool I'll gert myself set up [13:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon) [13:05:51] CentralAuth doesn’t have any weird special train / deployment branch stuff, right? [13:05:54] that’s CentralNotice I’m thinking of [13:08:01] I'm not away of anything funny like with CN [13:08:08] yeah, it looks like nothing special had to be done for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1054574 either [13:08:14] I’ll just hope for the best then ^^ [13:08:19] (03PS1) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [13:15:23] (03Merged) 10jenkins-bot: Save the request before starting the automatic vanish job [extensions/CentralAuth] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1062996 (https://phabricator.wikimedia.org/T372006) (owner: 10Seddon) [13:15:58] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] [13:16:19] T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006 [13:17:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:11] (03PS2) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [13:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:22:23] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:23:55] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:24:08] that k8s image build sure is taking a while [13:24:09] * Lucas_WMDE looks [13:24:51] okay, it’s making some progress again [13:25:09] looks like it had been pushing an image to the registry from 13:18 until about 13:24, which feels longer than it should be [13:25:58] !log klausman@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:25:59] anyway, that finished, now docker_pull_k8s is running [13:26:12] !log klausman@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:26:15] (also quite slowly… maybe the image diff is bigger than usual, though idk why it would be) [13:26:56] Long backports seems to be the norm with mw-on-k8s [13:27:38] Lucas_WMDE: that definitely feels longer than it should take [13:27:41] (03PS3) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [13:27:55] Seddon: it got a lot better in the meantime tbh [13:28:07] claime: docker_pull_k8s is currently at 15% after 4 minutes fwiw [13:28:50] the image that took long to push was docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-08-15-131615-publish btw [13:29:11] (whereas pushing docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2024-08-14-180507-webserver only took two seconds) [13:29:37] (oh wait, that one’s a timestamp from yesterday so it probably had nothing to do anyway) [13:30:14] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:30:29] (i don't know what c.laime specifically is doing, but today is a holiday in france, just for the record) [13:30:42] (03PS4) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [13:31:08] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:31:43] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560 (10MatthewVernon) 03NEW [13:31:44] ihurbain: thanks, I totally forgot about that – it’s not a holiday in godless berlin 😔 [13:31:50] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:31:53] (but it is in bavaria and a few other states. federalism!) [13:32:08] oh wait, I meant to ping cdanis anyway 🤦 [13:32:25] Lucas_WMDE: it's not in kanton zürich either, but it is in kanton zug next to it (yay federalism too :P ) [13:32:32] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:32:34] haha :D [13:32:36] :D [13:32:49] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:33:30] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:33:45] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:34:01] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:34:13] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:34:29] docker_pull_k8s finished, yay [13:34:37] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:34:55] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:35:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10066857 (10MatthewVernon) p:05Triage→03High [13:38:07] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [13:40:31] !log lucaswerkmeister-wmde@deploy1003 seddon, lucaswerkmeister-wmde: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:40:40] at last! [13:40:42] Seddon: please test ^^ [13:40:46] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:40:47] (if possible) [13:40:48] T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006 [13:41:39] @Lucas_WMDE mwdebug? [13:41:42] Or production? [13:41:47] mwdebug so far [13:41:51] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [13:42:08] Lucas_WMDE: equiad 1002? [13:42:21] mwdebug-k8s normally [13:42:34] but it should be synced to all of them [13:43:23] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [13:43:29] We are all good [13:44:39] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [13:44:43] !log lucaswerkmeister-wmde@deploy1003 seddon, lucaswerkmeister-wmde: Continuing with sync [13:44:46] ok, thanks! [13:45:33] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [13:46:12] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [13:47:59] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:49:51] (03PS7) 10Ssingh: P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) [13:50:26] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:50:43] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062996|Save the request before starting the automatic vanish job (T372006)]] (duration: 34m 44s) [13:51:05] !log sudo cumin "A:dnsbox" 'disable-puppet "merging CR 1053929 T369366"' [13:51:31] hmm [13:51:46] (03PS5) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [13:51:57] T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006 [13:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:11] looks like stashbot was lagging a bit [13:52:26] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [13:52:30] !log UTC afternoon backport+config window done [13:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:34] Seddon: should be deployed everywhere now [13:54:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3657/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:54:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10066924 (10Jhancock.wm) 05Open→03Resolved [13:54:16] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4004.wikimedia.org,service=recdns [reason: admin_state migration test] [13:54:22] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns4004.wikimedia.org [reason: admin_state migration test] [13:54:33] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10066925 (10Jhancock.wm) 05Open→03Resolved [13:54:41] (03CR) 10Ssingh: [C:03+2] P:dns::auth::update: maintain admin_state via confd [puppet] - 10https://gerrit.wikimedia.org/r/1053929 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [13:54:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:08] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified] [13:59:18] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified] [13:59:45] (03PS1) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 [13:59:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:57] (03PS6) 10Jelto: profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) [14:00:20] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site magru for service: text-addrs|text-next [reason: testing on dns4004, no task ID specified] [14:00:25] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site magru for service: text-addrs|text-next [reason: testing on dns4004, no task ID specified] [14:02:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [14:04:45] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: testing on dns4004, no task ID specified] [14:04:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: testing on dns4004, no task ID specified] [14:04:50] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified] [14:04:51] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified] [14:06:22] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [14:06:28] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye [14:11:47] (03PS1) 10Ebernhardson: Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007 [14:14:35] (03PS1) 10Ssingh: P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) [14:14:59] (03CR) 10CI reject: [V:04-1] P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:15:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:15:57] (03PS2) 10Ssingh: P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) [14:17:01] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:17:55] !log jayme@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-main2007.codfw.wmnet with OS bullseye [14:18:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10066995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye executed with error... [14:18:43] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::update: fix location of admin_state file [puppet] - 10https://gerrit.wikimedia.org/r/1063009 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:18:44] (03CR) 10Ebernhardson: [C:03+2] Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007 (owner: 10Ebernhardson) [14:19:42] (03Merged) 10jenkins-bot: Revert^2 "Search update pipeline: consume consolidated page-weighted-tags-change-stream" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063007 (owner: 10Ebernhardson) [14:21:05] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:21:11] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:29] (03PS2) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 [14:25:09] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2007.codfw.wmnet with OS bullseye [14:25:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067026 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye [14:33:42] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified] [14:33:46] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: testing on dns4004, no task ID specified] [14:35:55] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:36:18] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified] [14:36:21] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: testing on dns4004, no task ID specified] [14:39:25] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:08] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:42:30] (03CR) 10Eevans: aqs1022: provision new host for hardware refresh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [14:43:19] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:44:43] (03CR) 10Eevans: aqs1022: provision new host for hardware refresh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [14:44:55] (03PS2) 10Eevans: aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) [14:46:27] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:47:36] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:48:21] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:49:28] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [14:53:39] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [14:57:58] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [14:59:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:00:02] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdb200[45] - https://phabricator.wikimedia.org/T369920#10067165 (10Dwisehaupt) @Papaul Oh, thanks for finding that. I didn't spot the wrong field when updating the typo. DNS looks good now. Thanks! [15:00:04] jeena and jnuche: Your horoscope predicts another Train log triage deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1500). [15:00:23] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site magru [reason: testing on dns4004, no task ID specified] [15:00:24] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site magru [reason: testing on dns4004, no task ID specified] [15:00:38] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: testing on dns4004, no task ID specified] [15:00:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: testing on dns4004, no task ID specified] [15:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:58] (03CR) 10MVernon: [C:03+1] aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [15:03:49] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [15:03:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [15:04:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:09:21] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [15:09:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [15:09:24] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: no reason specified, no task ID specified] [15:09:26] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: no reason specified, no task ID specified] [15:14:05] (03PS1) 10Ssingh: P:dns::auth::update set confd_admin_state true for all DNS boxes [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366) [15:15:44] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:17:14] (03CR) 10Ssingh: [V:03+1 C:03+2] P:dns::auth::update set confd_admin_state true for all DNS boxes [puppet] - 10https://gerrit.wikimedia.org/r/1063012 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [15:18:29] (03CR) 10Andrew Bogott: [C:03+1] ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [15:19:19] (03CR) 10David Caro: [C:03+2] ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [15:20:33] (03Merged) 10jenkins-bot: ceph: add alert when we get no data from the cluster [alerts] - 10https://gerrit.wikimedia.org/r/1062962 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [15:20:55] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns4004.wikimedia.org [reason: moving ahead with admin_state migration] [15:21:01] !log running authdns-update [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:32] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: show site None [reason: no reason specified, no task ID specified] [15:21:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: show site None [reason: no reason specified, no task ID specified] [15:27:32] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main2008.codfw.wmnet with OS bullseye [15:29:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067228 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jayme@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye [15:30:16] !log ebernhardson@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:30:25] !log ebernhardson@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:31:06] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:31:17] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:31:39] (03PS3) 10Hnowlan: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) [15:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:34] (03CR) 10Kamila Součková: [C:03+1] "Generally LGTM, seems like the least bad option." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [15:37:26] (03CR) 10Kamila Součková: [C:03+1] Add policy to allow GeoIP hostPath volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054905 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [15:41:34] (03PS2) 10Ssingh: Remove admin_state handling from ops/dns [dns] - 10https://gerrit.wikimedia.org/r/1059140 (owner: 10BBlack) [15:42:44] (03CR) 10Eevans: [C:03+2] aqs1022: provision new host for hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/1062772 (https://phabricator.wikimedia.org/T372514) (owner: 10Eevans) [15:43:15] (03CR) 10Ssingh: [C:03+2] Remove admin_state handling from ops/dns [dns] - 10https://gerrit.wikimedia.org/r/1059140 (owner: 10BBlack) [15:43:25] (03PS1) 10EoghanGaffney: apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015 [15:43:49] (03CR) 10CI reject: [V:04-1] apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015 (owner: 10EoghanGaffney) [15:43:52] !log running authdns-update [15:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:48] (03PS2) 10EoghanGaffney: apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015 [15:44:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install aqs1022.eqiad.wmnet - https://phabricator.wikimedia.org/T372514#10067280 (10Eevans) a:05Eevans→03None [15:45:17] !log running authdns-update again [15:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:29] (03CR) 10Dzahn: "thanks for merging and testing this" [puppet] - 10https://gerrit.wikimedia.org/r/1062471 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:48:55] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [15:49:15] (03CR) 10Kamila Součková: [C:03+1] "Maybe s/but-ptrace/except-ptrace/g ? Not a strong opinion though." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054891 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [15:51:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:48] !log sudo cumin -b1 -s60 "A:dnsbox" "run-puppet-agent --enable 'merging CR 1053929 T369366'": T369366 [15:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:19] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [15:53:53] !log reran druid_load_geoeditors_monthly, cassandra_load_editors_by_country_monthly, and druid_load_edit_hourly airflow dags with run_id scheduled__2024-06-01T00:00:00+00:00 as part of down stream tasks after rerunning mediawiki_history_denormalize for 2024-06 snapshot. [15:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:09] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2007.codfw.wmnet with OS bullseye [15:55:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067297 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2007.codfw.wmnet with OS bullseye executed with error... [15:59:37] (03PS3) 10JMeybohm: reimage: Don't fail when mkfs takes a long time [cookbooks] - 10https://gerrit.wikimedia.org/r/1063006 [16:00:04] jhathaway and rzl: May I have your attention please! Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:02] (03PS1) 10David Caro: ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) [16:03:05] (03CR) 10Andrew Bogott: [C:03+1] ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [16:03:32] (03CR) 10David Caro: [C:03+2] ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [16:04:43] (03Merged) 10jenkins-bot: ceph: use the right metric for unknown alert [alerts] - 10https://gerrit.wikimedia.org/r/1063017 (https://phabricator.wikimedia.org/T372528) (owner: 10David Caro) [16:05:26] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:06:27] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10067319 (10Dzahn) Ah! That's good, I think in this case we just need a confirmation from Katie and ask her to add you to the so called "NDA and MOU"-spreadsheet w... [16:13:30] (03CR) 10Hnowlan: [C:03+2] "For the short term given our p99 for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:15:21] (03Merged) 10jenkins-bot: shellbox: allow readinessCheck parameters to be passed in values files [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:15:53] (03CR) 10Hnowlan: "Oops, hit submit too early. Given our p75 of ~20s this is definitely a concern for the future. Our p95 is generally about 45s but can spik" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062055 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [16:32:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:36:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10067421 (10Andrew) @cmooney can we get cloudcephosd1036 set up now that the switch work is done? [16:44:20] (03PS1) 10Ebernhardson: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) [16:44:59] (03CR) 10CI reject: [V:04-1] cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [16:45:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [16:51:18] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [16:51:50] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [16:51:58] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [16:52:58] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-video: apply [16:53:54] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-video: apply [16:54:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main2008.codfw.wmnet with reason: host reimage [17:03:10] (03CR) 10Scott French: "Hmmm .. it looks like this was never applied?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1061856 (https://phabricator.wikimedia.org/T371885) (owner: 10Filippo Giunchedi) [17:07:24] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: testing live change, T369366] [17:07:35] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site ulsfo [reason: testing live change, T369366] [17:07:40] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [17:13:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2008.codfw.wmnet with OS bullseye [17:13:30] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10067507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jayme@cumin1002 for host kafka-main2008.codfw.wmnet with OS bullseye completed: - kafka-... [17:18:14] 06SRE, 06Traffic: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#10067510 (10ssingh) 05Open→03Resolved a:03ssingh https://wikitech.wikimedia.org/wiki/DNS#Change_GeoDNS_/_Depool_a_Site [17:19:10] jouncebot: now [17:19:10] For the next 0 hour(s) and 40 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1700) [17:19:10] For the next 0 hour(s) and 40 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1700) [17:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:37] * bd808 has nothing to deploy in his reserved window today [17:22:17] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: pool site ulsfo [reason: testing done, T369366] [17:22:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site ulsfo [reason: testing done, T369366] [17:22:33] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [17:33:23] (03PS1) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 [17:35:02] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10067540 (10KFrancis) Hi all, I'm confirming we have an NDA on file. Thanks! [17:39:33] (03PS2) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 [17:41:08] !log dwisehaupt@cumin1002 START - Cookbook sre.dns.netbox [17:44:35] !log dwisehaupt@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update mgmt dns for civi2002 frpig2002 - dwisehaupt@cumin1002" [17:44:39] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update mgmt dns for civi2002 frpig2002 - dwisehaupt@cumin1002" [17:44:39] !log dwisehaupt@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:44:58] (03PS1) 10Scott French: mediawiki: consistently apply stats-global values via symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265) [17:44:59] (03PS1) 10Scott French: mw-debug: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063032 (https://phabricator.wikimedia.org/T368366) [17:45:01] (03PS1) 10Scott French: mw-api-int: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) [17:45:03] (03PS1) 10Scott French: mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366) [17:52:50] (03CR) 10CI reject: [V:04-1] sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 (owner: 10Ssingh) [17:54:16] (03PS3) 10Ssingh: sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 [17:59:39] (03PS1) 10Peter Fischer: Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) [18:00:05] jeena and jnuche: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T1800). [18:00:25] !log aokoth@cumin1002 START - Cookbook sre.vrts.upgrade on VRTS host vrts1001.eqiad.wmnet [18:02:10] !log aokoth@cumin1002 END (PASS) - Cookbook sre.vrts.upgrade (exit_code=0) on VRTS host vrts1001.eqiad.wmnet [18:03:34] (03PS1) 10TrainBranchBot: group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963) [18:03:36] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [18:04:15] (03Merged) 10jenkins-bot: group2 to 1.43.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063036 (https://phabricator.wikimedia.org/T366963) (owner: 10TrainBranchBot) [18:15:00] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.43.0-wmf.18 refs T366963 [18:15:20] T366963: 1.43.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T366963 [18:26:33] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: do not SAL on --show [cookbooks] - 10https://gerrit.wikimedia.org/r/1063028 (owner: 10Ssingh) [18:54:47] !log running global rename cleanup script per T372006#10055573 [18:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:07] T372006: Unblock stuck global rename of multiple users - https://phabricator.wikimedia.org/T372006 [19:26:28] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) (owner: 10Gerrit maintenance bot) [19:28:53] (03CR) 10BCornwall: [C:03+1] Remove entries for payments2001 and payments2002 [dns] - 10https://gerrit.wikimedia.org/r/1062155 (https://phabricator.wikimedia.org/T371630) (owner: 10Dwisehaupt) [19:38:05] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10067817 (10Jhancock.wm) [19:42:16] jouncebot next [19:42:16] In 0 hour(s) and 17 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T2000) [19:49:48] 06SRE, 10SRE-Access-Requests: Requesting access to for  - https://phabricator.wikimedia.org/T372445#10067847 (10ecarg) @JMeybohm I receive the following after replacing with that url: ` curl -XGET 'https://logs-api.svc.eqiad.wmnet/_msearch?pretty&size=10000... [19:54:40] (03PS1) 10Bartosz Dziewoński: Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) [19:54:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 15 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński) [19:56:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10067859 (10Jhancock.wm) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240815T2000). Please do the needful. [20:00:05] ebernhardson and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:24] hi [20:00:59] i just made my patch a minute ago, i haven't found anyone to review it yet, but it's hopefully a traightforward revert [20:01:34] \o [20:01:41] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10067873 (10Jhancock.wm) [20:06:40] hmm, do we have any deployers around? [20:07:59] MatmaRex: aren't you a deployer? [20:08:07] if not i guess i can [20:08:11] nope [20:08:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński) [20:09:54] (03CR) 10Ebernhardson: [C:03+1] Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer) [20:11:11] (03CR) 10Scott French: [C:03+1] "Thanks, Hugh!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [20:19:39] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10067961 (10Jhancock.wm) [20:20:03] (03Merged) 10jenkins-bot: Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" [extensions/DiscussionTools] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063041 (https://phabricator.wikimedia.org/T372499) (owner: 10Bartosz Dziewoński) [20:20:16] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]] [20:20:40] T372499: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads (August 2024) - https://phabricator.wikimedia.org/T372499 [20:23:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10067966 (10Jhancock.wm) @MoritzMuehlenhoff when you have a moment, can you do this step for me please? thanks! Update the operations/puppet repo [20:23:33] !log ebernhardson@deploy1003 ebernhardson, matmarex: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:46] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10067968 (10Jhancock.wm) @aborrero when you have a moment, can you do this step for me please? thanks! Update the operations/puppet repo [20:24:25] testing [20:24:50] and it looks good [20:25:13] (tested on https://en.wikivoyage.org/wiki/Template_talk:Related#Related_topics_&_empty_space_in_articles_bottom - with the fix applied, the page has "Reply" buttons again) [20:25:40] MatmaRex: excellent, going forward [20:25:46] !log ebernhardson@deploy1003 ebernhardson, matmarex: Continuing with sync [20:30:23] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063041|Revert "CommentFormatter: Switch from deprecated addJsConfigVars to new setJsConfigVar" (T372499)]] (duration: 10m 06s) [20:30:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [20:31:06] T372499: InvalidArgumentException: Multiple conflicting values given for wgDiscussionToolsPageThreads (August 2024) - https://phabricator.wikimedia.org/T372499 [20:31:53] (03CR) 10Scott French: [C:03+1] rpc: add script for running jobs from stdin rather than http (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [20:32:12] (03CR) 10CI reject: [V:04-1] cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [20:32:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:33:51] thank you very much for deploying ebernhardson [20:33:59] certainly [20:35:14] (03PS2) 10Ebernhardson: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) [20:36:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [20:36:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:37:09] (03Merged) 10jenkins-bot: cirrus: Stop general writes to private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063026 (https://phabricator.wikimedia.org/T341332) (owner: 10Ebernhardson) [20:37:19] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]] [20:37:51] T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332 [20:39:17] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:41:15] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:45:45] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063026|cirrus: Stop general writes to private wikis (T341332)]] (duration: 08m 25s) [20:46:10] T341332: [EPIC] The CirrusSearch streaming updater should support private wikis - https://phabricator.wikimedia.org/T341332 [20:49:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [20:51:35] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer) [20:52:37] (03Merged) 10jenkins-bot: Search update pipeline: bump version, write weighted tags to ES [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063035 (https://phabricator.wikimedia.org/T372362) (owner: 10Peter Fischer) [20:54:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2035 to codfw - jhancock@cumin2002" [20:55:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2035 to codfw - jhancock@cumin2002" [20:55:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:01:13] !log backport window complete [21:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:37:44] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [21:41:51] (03PS1) 10Ryan Kemper: wdqs graph-split: data xfer needs python3-snappy [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) [21:42:24] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:43:16] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [21:43:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [21:43:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:44:36] (03PS1) 10BCornwall: varnish: Remove carriers netmap [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) [21:45:57] (03PS8) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [21:46:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [21:46:56] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:47:04] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:48:28] (03CR) 10Bking: [C:03+1] "The PCC failure is for Puppet 5 only; the change works with Puppet 7." [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:48:37] (03CR) 10Ryan Kemper: [C:03+2] wdqs graph-split: data xfer needs python3-snappy [puppet] - 10https://gerrit.wikimedia.org/r/1063067 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [21:50:44] (03CR) 10Cwhite: [C:03+1] thanos: temp disable compact [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [21:51:16] (03PS1) 10Andrea Denisse: alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) [21:51:16] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1062444/3666/" [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [21:53:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards [21:53:15] (03CR) 10Cwhite: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [21:53:31] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [21:53:55] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling neither afterwards [21:54:03] (03CR) 10Cwhite: [C:03+1] alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [21:54:27] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [21:55:06] (03PS1) 10Andrea Denisse: alert: Add the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) [21:55:06] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1063063/3667/" [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [21:55:32] (03CR) 10BCornwall: "FWIW, vtc tests passed:" [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [21:59:18] (03CR) 10CI reject: [V:04-1] wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [22:01:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [22:02:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudlb2004-dev'] [22:09:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudlb2004-dev'] [22:10:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [22:10:18] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10068225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [22:13:34] (03PS1) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) [22:13:34] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1063075/3668/" [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [22:21:04] (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) [22:42:57] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal to codfw host) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling neither afterwards [22:43:15] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [23:10:20] !log T372449 mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'Philip Federici' 'FilippoFederici' --ignorestatus [23:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:39] T372449: Unblock stuck global rename of FilippoFederici - https://phabricator.wikimedia.org/T372449 [23:30:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [23:30:29] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10068319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063080 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063080 (owner: 10TrainBranchBot)