[00:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:07:11] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063356 (owner: 10TrainBranchBot) [01:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:39:26] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:14:21] (03PS1) 10Marostegui: installserver: Do not format db2225 [puppet] - 10https://gerrit.wikimedia.org/r/1063358 [05:16:48] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1063358 (owner: 10Marostegui) [05:17:22] (03CR) 10Marostegui: [C:03+2] installserver: Do not format db2225 [puppet] - 10https://gerrit.wikimedia.org/r/1063358 (owner: 10Marostegui) [05:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:23:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1195 T372536', diff saved to https://phabricator.wikimedia.org/P67363 and previous config saved to /var/cache/conftool/dbconfig/20240819-052352-root.json [05:23:55] T372536: Compile and package MariaDB 10.6.19 - https://phabricator.wikimedia.org/T372536 [05:24:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db1195.eqiad.wmnet with reason: Upgrade to 10.6.19 [05:25:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1195.eqiad.wmnet with reason: Upgrade to 10.6.19 [05:30:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67364 and previous config saved to /var/cache/conftool/dbconfig/20240819-053000-root.json [05:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:45:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67365 and previous config saved to /var/cache/conftool/dbconfig/20240819-054506-root.json [05:53:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67366 and previous config saved to /var/cache/conftool/dbconfig/20240819-060011-root.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67367 and previous config saved to /var/cache/conftool/dbconfig/20240819-061517-root.json [06:16:37] (03CR) 10Slyngshede: [C:03+1] "Looks correct." [puppet] - 10https://gerrit.wikimedia.org/r/1061118 (owner: 10Majavah) [06:18:47] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding of mcastro [puppet] - 10https://gerrit.wikimedia.org/r/1059743 (owner: 10Slyngshede) [06:30:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67368 and previous config saved to /var/cache/conftool/dbconfig/20240819-063023-root.json [06:32:45] (03CR) 10Ayounsi: [C:03+2] hieradata: idp: Allow 'nda' to use netbox_oidc client [puppet] - 10https://gerrit.wikimedia.org/r/1061118 (owner: 10Majavah) [06:35:36] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [06:41:12] (03CR) 10Ayounsi: [C:03+2] Netbox: set RQ_DEFAULT_TIMEOUT back to default of 300 [puppet] - 10https://gerrit.wikimedia.org/r/1063168 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [06:45:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67369 and previous config saved to /var/cache/conftool/dbconfig/20240819-064528-root.json [06:46:50] (03CR) 10Ayounsi: [C:03+2] Netbox: remove prefer_ipv4 flag [puppet] - 10https://gerrit.wikimedia.org/r/1063178 (owner: 10Ayounsi) [06:52:22] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [06:54:16] (03CR) 10Ayounsi: [C:03+2] Netbox: disable rq-netbox on secondary node [puppet] - 10https://gerrit.wikimedia.org/r/1063179 (https://phabricator.wikimedia.org/T341843) (owner: 10Ayounsi) [07:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1195 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67370 and previous config saved to /var/cache/conftool/dbconfig/20240819-070034-root.json [07:04:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:12:21] (03CR) 10Ayounsi: [C:03+2] Network report, remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062665 (owner: 10Ayounsi) [07:14:17] (03Merged) 10jenkins-bot: Network report, remove clusters from NO_V6_DEVICE_NAME_PREFIXES [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062665 (owner: 10Ayounsi) [07:14:44] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:14:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:16:29] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:16:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:20:55] (03CR) 10Ayounsi: [C:03+2] Remove custom_script_proxy.py and getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060079 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [07:22:49] (03Merged) 10jenkins-bot: Remove custom_script_proxy.py and getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060079 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [07:24:58] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [07:25:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [07:25:18] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:25:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:28:08] (03PS1) 10Kevin Bazira: ml-services: huggingface from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063704 (https://phabricator.wikimedia.org/T369344) [07:28:51] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Ensure the alert[12]002 hosts use the alerting_host role [puppet] - 10https://gerrit.wikimedia.org/r/1062444 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:33:56] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:37:00] (03CR) 10Filippo Giunchedi: alert: Ensure alert1002 is the active alert host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:37:48] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, to be merged when the time comes of course" [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:37:54] (03CR) 10Filippo Giunchedi: [C:03+2] icinga: Add payments2004 and payments2005 to service [puppet] - 10https://gerrit.wikimedia.org/r/1063247 (https://phabricator.wikimedia.org/T369942) (owner: 10Dwisehaupt) [07:39:03] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [07:42:34] (03CR) 10Filippo Giunchedi: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [07:48:52] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102#10072072 (10brouberol) The upgrade of kafka is currently a stretch OKR and will probably be a core OKR for next quarter. I'll definitely need... [07:50:43] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:41] (03PS1) 10Filippo Giunchedi: rancid: add git directory as safe [puppet] - 10https://gerrit.wikimedia.org/r/1063727 [07:56:18] (03PS1) 10Ayounsi: os-updates-report: don't fail on hosts with no roles [puppet] - 10https://gerrit.wikimedia.org/r/1063728 (https://phabricator.wikimedia.org/T372728) [07:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:40] (03PS5) 10Brouberol: airflow: deploy postgresql cluster before airflow itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062397 (https://phabricator.wikimedia.org/T372286) [08:01:40] (03PS6) 10Brouberol: airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) [08:07:32] klausman: I wanted to rebase https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062688 but there seems to be a merge conflict in site.pp [08:07:47] lemme check [08:10:58] oh I flubbed it. that change is actually obsolete. Sorry 'bout that! [08:11:53] (03Abandoned) 10Klausman: manifest/hiera/conftool: Add new ML GPU hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1062688 (https://phabricator.wikimedia.org/T368978) (owner: 10Klausman) [08:12:09] brouberol: sorry for wasting your time there [08:14:39] !log brouberol@cumin1002 START - Cookbook sre.dns.netbox [08:14:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10072112 (10klausman) a:05klausman→03None [08:17:08] (03CR) 10Slyngshede: [C:03+2] Implement 2FA support [software/bitu] - 10https://gerrit.wikimedia.org/r/1057862 (owner: 10Slyngshede) [08:18:37] !log brouberol@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to snapshot1010 and dumpsdata1003 - brouberol@cumin1002" [08:18:42] !log brouberol@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Adding AAAA field to snapshot1010 and dumpsdata1003 - brouberol@cumin1002" [08:18:42] !log brouberol@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:18:54] (03PS1) 10Kevin Bazira: ml-services: set API_PREFIX in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063729 (https://phabricator.wikimedia.org/T371465) [08:27:41] (03PS1) 10Ayounsi: Network report: remove dumpsdata and snapshot from v6 test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063730 (https://phabricator.wikimedia.org/T372453) [08:28:30] (03CR) 10Ayounsi: [C:03+1] rancid: add git directory as safe [puppet] - 10https://gerrit.wikimedia.org/r/1063727 (owner: 10Filippo Giunchedi) [08:29:57] (03CR) 10Ayounsi: [C:03+2] "Self merging as it's a minor change" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063730 (https://phabricator.wikimedia.org/T372453) (owner: 10Ayounsi) [08:31:55] (03Merged) 10jenkins-bot: Network report: remove dumpsdata and snapshot from v6 test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063730 (https://phabricator.wikimedia.org/T372453) (owner: 10Ayounsi) [08:31:59] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [08:32:12] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [08:32:39] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:33:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [08:33:08] (03CR) 10Jelto: [V:03+1 C:03+2] profile::firewall::nftables_throttling: add option for burst packets [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:34:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136', diff saved to https://phabricator.wikimedia.org/P67371 and previous config saved to /var/cache/conftool/dbconfig/20240819-083439-root.json [08:34:43] (03PS1) 10Clément Goubert: Fix typo in new wikikube-worker regex [puppet] - 10https://gerrit.wikimedia.org/r/1063732 (https://phabricator.wikimedia.org/T368933) [08:35:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2136.codfw.wmnet with reason: Upgrade to 10.11.9 [08:35:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2136.codfw.wmnet with reason: Upgrade to 10.11.9 [08:35:41] !log Upgrade db2136 to 10.11.9 T372551 [08:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:49] T372551: Compile and package MariaDB 10.11.9 - https://phabricator.wikimedia.org/T372551 [08:35:55] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:36:39] (03CR) 10Ayounsi: [C:03+1] Fix typo in new wikikube-worker regex [puppet] - 10https://gerrit.wikimedia.org/r/1063732 (https://phabricator.wikimedia.org/T368933) (owner: 10Clément Goubert) [08:37:14] (03CR) 10Filippo Giunchedi: [C:03+2] rancid: add git directory as safe [puppet] - 10https://gerrit.wikimedia.org/r/1063727 (owner: 10Filippo Giunchedi) [08:37:18] (03PS2) 10Samtar: [WIP] Add CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) [08:37:24] (03PS3) 10Samtar: [WIP] Add CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) [08:37:34] (03PS4) 10Samtar: Add CommunityRequests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) [08:37:56] (03CR) 10Clément Goubert: [C:03+2] Fix typo in new wikikube-worker regex [puppet] - 10https://gerrit.wikimedia.org/r/1063732 (https://phabricator.wikimedia.org/T368933) (owner: 10Clément Goubert) [08:38:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P67372 and previous config saved to /var/cache/conftool/dbconfig/20240819-083814-root.json [08:38:32] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: temp disable compact [puppet] - 10https://gerrit.wikimedia.org/r/1062678 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [08:39:35] (03CR) 10MVernon: [C:03+2] cephadm: separate templates for zonegroup setup and rgw placement [puppet] - 10https://gerrit.wikimedia.org/r/1063196 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:40:35] (03PS7) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [08:40:45] (03CR) 10Ayounsi: check_netbox_report.py: reports -> scripts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [08:41:37] (03CR) 10AOkoth: vrts: build & install packages (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1062715 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [08:41:46] (03CR) 10Ayounsi: [C:03+2] check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 (owner: 10Ayounsi) [08:44:52] (03PS1) 10AOkoth: vrts: run install script on new server [puppet] - 10https://gerrit.wikimedia.org/r/1063733 [08:48:51] (03CR) 10Filippo Giunchedi: opensearch: unreach port and shards alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [08:49:38] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062431 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [08:49:45] (03PS9) 10Arnaudb: mariadb: cookbook to safely upgrade and reboot santarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) [08:50:59] (03CR) 10Btullis: [C:03+1] airflow: deploy postgresql cluster before airflow itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062397 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [08:51:23] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add oauth2-proxy for OIDC authentication [puppet] - 10https://gerrit.wikimedia.org/r/1062393 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [08:52:47] (03PS2) 10AOkoth: vrts: run install script on new server [puppet] - 10https://gerrit.wikimedia.org/r/1063733 [08:53:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P67373 and previous config saved to /var/cache/conftool/dbconfig/20240819-085320-root.json [08:54:04] (03CR) 10Btullis: [C:03+1] airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [08:54:07] (03PS10) 10Arnaudb: mariadb: cookbook to safely upgrade and reboot santarium hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) [08:56:01] (03CR) 10Arnaudb: mariadb: cookbook to safely upgrade and reboot santarium hosts (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1063167 (https://phabricator.wikimedia.org/T363665) (owner: 10Arnaudb) [08:57:31] (03PS1) 10Slyngshede: C:idm::deployment Add Bitu 2FA dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/1063736 [08:59:24] (03PS1) 10Filippo Giunchedi: prometheus: fix auth_cas vhost configuration [puppet] - 10https://gerrit.wikimedia.org/r/1063737 (https://phabricator.wikimedia.org/T326657) [08:59:38] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: fix auth_cas vhost configuration [puppet] - 10https://gerrit.wikimedia.org/r/1063737 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [09:00:55] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:02:57] (03CR) 10Ayounsi: [C:03+2] Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [09:04:56] (03Merged) 10jenkins-bot: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [09:05:57] (03CR) 10Ayounsi: [C:03+2] Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [09:06:52] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:07:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:07:46] (03Merged) 10jenkins-bot: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [09:08:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P67374 and previous config saved to /var/cache/conftool/dbconfig/20240819-090825-root.json [09:11:55] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:14] (03CR) 10Brouberol: [C:03+2] rbac: grant RBAC permissions on cloudnative-pg CRDs to our view/deploy ClusterRoles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062431 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [09:16:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:17:22] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:19:15] (03CR) 10Klausman: [C:03+1] ml-services: set API_PREFIX in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063729 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [09:19:27] (03CR) 10Klausman: [C:03+1] ml-services: huggingface from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063704 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:23:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P67375 and previous config saved to /var/cache/conftool/dbconfig/20240819-092331-root.json [09:23:40] (03PS1) 10Clément Goubert: cumin: Remove parsoid from aliases [puppet] - 10https://gerrit.wikimedia.org/r/1063744 (https://phabricator.wikimedia.org/T359387) [09:25:05] (03CR) 10Brouberol: [C:03+2] airflow: deploy postgresql cluster before airflow itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062397 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [09:28:13] (03CR) 10Hnowlan: [C:03+1] APIGW: Add configuration to expose LW isvc articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063225 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [09:28:19] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10072457 (10fnegri) [09:28:27] (03CR) 10Brouberol: [C:03+2] airflow: fetch PG connection URI from the cloudnative PG cluster secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1062398 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [09:29:24] (03CR) 10Kamila Součková: [C:03+1] (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [09:32:10] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:36:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [09:38:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P67376 and previous config saved to /var/cache/conftool/dbconfig/20240819-093836-root.json [09:48:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10072553 (10VRiley-WMF) a:03VRiley-WMF [09:49:26] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10072554 (10VRiley-WMF) Thanks for the information. I should be able to replace it without a problem. Since this is under warranty, I'll be getting a drive for this as soon as possible. [09:50:04] (03PS9) 10Btullis: Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [09:50:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10072560 (10VRiley-WMF) a:03VRiley-WMF [09:50:33] (03CR) 10Slyngshede: [C:03+2] 2FA: Implement recovery codes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1059850 (owner: 10Slyngshede) [09:51:05] (03PS1) 10Jelto: gitlab: add option to serve a robots.txt and enable it in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) [09:51:56] (03CR) 10Slyngshede: "No, not yet, but I'm open to suggestions. The actions are logged, but it might be a sensible idea to log these to a separate log file." [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:51:59] (03CR) 10Slyngshede: [C:03+2] Wikimedia: New management command for blocking users in systems. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:53:18] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:53:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P67377 and previous config saved to /var/cache/conftool/dbconfig/20240819-095342-root.json [09:55:22] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372336#10072579 (10VRiley-WMF) 05Open→03Resolved I have attempted to rebalance this power again. I'm hopeful that this should help with the errors. [09:56:02] (03Merged) 10jenkins-bot: Wikimedia: New management command for blocking users in systems. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) (owner: 10Slyngshede) [09:58:55] (03CR) 10Jelto: [V:03+1 C:03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3676/console" [puppet] - 10https://gerrit.wikimedia.org/r/1063004 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1000) [10:01:08] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265) (owner: 10Scott French) [10:05:46] (03CR) 10Kevin Bazira: [C:03+2] ml-services: huggingface from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063704 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [10:05:49] (03CR) 10Btullis: [V:03+1] "Adding WMCS engineers for visibility. This should be a no-op for any WMCS Ceph clusters, but it touches a shared template so I'll share ho" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [10:07:33] (03Merged) 10jenkins-bot: ml-services: huggingface from src dir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063704 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [10:08:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P67378 and previous config saved to /var/cache/conftool/dbconfig/20240819-100847-root.json [10:10:13] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:10:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:14:08] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:20:44] (03PS1) 10Filippo Giunchedi: prometheus: enable oidc auth [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) [10:21:05] (03CR) 10Ayounsi: [C:03+1] cumin: Remove parsoid from aliases [puppet] - 10https://gerrit.wikimedia.org/r/1063744 (https://phabricator.wikimedia.org/T359387) (owner: 10Clément Goubert) [10:21:48] (03CR) 10Clément Goubert: [C:03+2] cumin: Remove parsoid from aliases [puppet] - 10https://gerrit.wikimedia.org/r/1063744 (https://phabricator.wikimedia.org/T359387) (owner: 10Clément Goubert) [10:22:50] (03PS2) 10Filippo Giunchedi: prometheus: enable oidc auth [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) [10:23:14] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [10:25:06] (03PS1) 10Ayounsi: Interface validator: device_role -> role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063762 [10:25:34] (03PS3) 10Filippo Giunchedi: prometheus: enable oidc auth [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) [10:25:37] (03CR) 10Ayounsi: "Introduced in I8b372f6a08cc1f1c7c3bf570c77ef20f3fe6407c" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063762 (owner: 10Ayounsi) [10:26:21] (03PS1) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [10:27:06] (03PS2) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [10:27:22] (03CR) 10Ayounsi: [C:03+2] Interface validator: device_role -> role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063762 (owner: 10Ayounsi) [10:28:32] (03PS3) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [10:29:31] (03Merged) 10jenkins-bot: Interface validator: device_role -> role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063762 (owner: 10Ayounsi) [10:29:55] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [10:30:26] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [10:31:50] (03PS4) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [10:33:37] (03PS1) 10Filippo Giunchedi: prometheus: align prometheus and prometheus::pop role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1063764 [10:33:53] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: align prometheus and prometheus::pop role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1063764 (owner: 10Filippo Giunchedi) [10:36:47] (03CR) 10Kevin Bazira: [C:03+2] ml-services: set API_PREFIX in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063729 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [10:38:03] (03CR) 10Clément Goubert: Remove parsoid-php from MEDIAWIKI_SERVICES (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [10:38:05] (03Merged) 10jenkins-bot: ml-services: set API_PREFIX in rec-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063729 (https://phabricator.wikimedia.org/T371465) (owner: 10Kevin Bazira) [10:41:40] (03CR) 10Brouberol: Add radosgw services to the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [10:42:15] (03CR) 10Slyngshede: [C:03+2] C:idm::deployment Add Bitu 2FA dependencies. [puppet] - 10https://gerrit.wikimedia.org/r/1063736 (owner: 10Slyngshede) [10:45:38] jouncebot: nowandnext [10:45:38] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1000) [10:45:38] In 2 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1300) [10:48:17] (03PS1) 10AOkoth: sql_exporter: provide a way to specify metric type [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [10:49:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:49:11] (03CR) 10Btullis: [V:03+1] Add radosgw services to the cephosd servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [10:49:16] (03PS2) 10AOkoth: sql_exporter: provide a way to specify metric type [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [10:52:33] (03PS3) 10AOkoth: sql_exporter: provide a way to specify metric type [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [10:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:54:53] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1063766/3678/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [10:57:24] (03CR) 10Clément Goubert: [C:03+1] mw-debug: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063032 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [11:00:01] (03PS1) 10Brouberol: cloudnative-pg: configure the operator to watch the airflow-test-k8s ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063769 (https://phabricator.wikimedia.org/T372286) [11:01:03] (03PS3) 10Hnowlan: rpc: add script for running jobs from stdin rather than http [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) [11:01:51] (03CR) 10Hnowlan: "I think just vanilla `scripts` works a little better - not sure why but reusing the `wmf-scripts` pattern for something that is quite outs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [11:04:08] (03CR) 10Stevemunene: [C:03+1] cloudnative-pg: configure the operator to watch the airflow-test-k8s ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063769 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [11:05:35] (03PS5) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [11:07:37] (03CR) 10Btullis: [C:03+1] cloudnative-pg: configure the operator to watch the airflow-test-k8s ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063769 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [11:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:13:20] (03CR) 10Hnowlan: [C:03+1] sre.k8s: Add pool-depool-node cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1059045 (owner: 10Clément Goubert) [11:15:24] (03PS3) 10Hnowlan: thumbor: add allowlist to thumbor to address internal rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063228 (https://phabricator.wikimedia.org/T372470) [11:17:19] (03PS6) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [11:19:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#10072826 (10VRiley-WMF) [11:25:36] (03PS1) 10Slyngshede: Properties page: Move 2FA button out from editable fields. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063774 [11:26:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10072829 (10ayounsi) @Jclark-ctr @Clement_Goubert fixed the typo, so you should be good to go. We also got this alert: ` FAIL: cumin-check-aliases Syst... [11:27:11] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:27:58] (03CR) 10Slyngshede: [C:03+2] Properties page: Move 2FA button out from editable fields. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063774 (owner: 10Slyngshede) [11:29:07] (03PS7) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [11:29:07] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1063766/3679/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [11:30:14] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [11:30:31] (03Merged) 10jenkins-bot: Properties page: Move 2FA button out from editable fields. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063774 (owner: 10Slyngshede) [11:32:34] jouncebot: nowandnext [11:32:34] No deployments scheduled for the next 1 hour(s) and 27 minute(s) [11:32:34] In 1 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1300) [11:34:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd100[1-4] - https://phabricator.wikimedia.org/T370546#10072837 (10VRiley-WMF) [11:41:39] (03PS4) 10Jgiannelos: mobileapps: Configure caching for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063765 (https://phabricator.wikimedia.org/T319365) [11:43:16] (03CR) 10Kamila Součková: [C:03+1] "LGTM, thank you for indulging my questions :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053911 (https://phabricator.wikimedia.org/T368251) (owner: 10JMeybohm) [11:49:33] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [11:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:25] !log Started scanning script for ruwiki with timeout of 6h to catchup to monthly request limit [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:01:15] (03PS1) 10Dreamy Jazz: Enable temporary accounts on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063776 (https://phabricator.wikimedia.org/T371116) [12:01:23] Going to deploy the above now [12:01:51] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:02:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063776 (https://phabricator.wikimedia.org/T371116) (owner: 10Dreamy Jazz) [12:02:59] (03Merged) 10jenkins-bot: Enable temporary accounts on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063776 (https://phabricator.wikimedia.org/T371116) (owner: 10Dreamy Jazz) [12:03:21] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1063776|Enable temporary accounts on test2wiki (T371116)]] [12:03:31] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:03:33] T371116: Deploy temporary accounts to test2wiki - https://phabricator.wikimedia.org/T371116 [12:09:22] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1063766/3679/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:09:50] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:11:44] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:12:26] (03CR) 10Filippo Giunchedi: [C:03+1] mariadb: tweaks monitoring thresholds for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) (owner: 10Arnaudb) [12:15:00] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:15:32] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: configure the operator to watch the airflow-test-k8s ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063769 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [12:16:41] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1063776|Enable temporary accounts on test2wiki (T371116)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:16:47] (03CR) 10Brouberol: "Thanks for the explanations!" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [12:16:49] T371116: Deploy temporary accounts to test2wiki - https://phabricator.wikimedia.org/T371116 [12:17:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:17:50] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:18:28] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [12:18:48] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:19:22] (03CR) 10Filippo Giunchedi: [C:03+1] Prometheus: Add recording rules computing commonly used envoy histograms [puppet] - 10https://gerrit.wikimedia.org/r/1055432 (https://phabricator.wikimedia.org/T369607) (owner: 10JMeybohm) [12:21:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:21:15] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063791 [12:23:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:25:36] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063776|Enable temporary accounts on test2wiki (T371116)]] (duration: 22m 14s) [12:25:42] T371116: Deploy temporary accounts to test2wiki - https://phabricator.wikimedia.org/T371116 [12:26:41] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s4 [12:26:46] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1015.eqiad.wmnet,service=s6 [12:27:09] (03PS1) 10Btullis: Update the public key used by Ifrahkhanyaree_WMDE [puppet] - 10https://gerrit.wikimedia.org/r/1063800 (https://phabricator.wikimedia.org/T371894) [12:27:34] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424 [12:27:47] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [12:27:47] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1015.eqiad.wmnet with reason: Reimaging clouddb1015 T365424 [12:27:51] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:31:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T367856)', diff saved to https://phabricator.wikimedia.org/P67382 and previous config saved to /var/cache/conftool/dbconfig/20240819-123119-marostegui.json [12:31:22] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:32:24] (03CR) 10EoghanGaffney: [C:03+2] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:32:30] (03CR) 10EoghanGaffney: [C:03+1] sre.gitlab.upgrade: also use the service name for the downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [12:32:37] (03CR) 10Slyngshede: [C:03+1] "LGTM, but please verify out of band with the user that this is in fact their key." [puppet] - 10https://gerrit.wikimedia.org/r/1063800 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [12:33:06] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1015.eqiad.wmnet with OS bookworm [12:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:34:24] (03CR) 10Vgutierrez: "a quick check on codesearch shows that apparently `X-CS` isn't being used anymore by mediawiki https://codesearch.wmcloud.org/search/?q=X-" [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [12:37:00] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:37:16] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:37:20] (03PS4) 10Filippo Giunchedi: prometheus: enable oidc auth [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) [12:37:43] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:38:01] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:38:18] !log pfischer@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:38:43] (03PS1) 10Brouberol: cloudnative-pg-cluster: enable ingress traffic to the PG pod API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063803 (https://phabricator.wikimedia.org/T372286) [12:39:09] (03PS2) 10Brouberol: cloudnative-pg-cluster: enable ingress traffic to the PG pod API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063803 (https://phabricator.wikimedia.org/T372286) [12:39:16] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [12:39:44] !log pfischer@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:39:55] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:41:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:45:54] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [12:45:59] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/1063761/4180/" [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [12:46:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P67383 and previous config saved to /var/cache/conftool/dbconfig/20240819-124625-marostegui.json [12:49:15] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1015.eqiad.wmnet with reason: host reimage [12:49:59] (03PS3) 10Slyngshede: PermissionRequest validation. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060812 [12:51:04] (03PS13) 10Arnaudb: mariadb: Tweak monitoring thresholds for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) [12:51:23] (03CR) 10Arnaudb: mariadb: Tweak monitoring thresholds for replication lag (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) (owner: 10Arnaudb) [12:51:56] (03PS2) 10Dreamy Jazz: Define wgVirtualDomainsMapping for virtual-checkuser-global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) [12:52:07] jouncebot: nowandnext [12:52:07] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [12:52:08] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1300) [12:52:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 19 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) (owner: 10Dreamy Jazz) [12:52:51] (03PS2) 10Klausman: services/apigw: drop prefix trim for recommendation-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) [12:53:05] (03CR) 10Arnaudb: [C:03+2] mariadb: Tweak monitoring thresholds for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) (owner: 10Arnaudb) [12:54:16] (03Merged) 10jenkins-bot: mariadb: Tweak monitoring thresholds for replication lag [alerts] - 10https://gerrit.wikimedia.org/r/1054893 (https://phabricator.wikimedia.org/T367279) (owner: 10Arnaudb) [12:54:47] (03CR) 10Filippo Giunchedi: [C:04-1] "I think the "metrics" data structure should support explicit fields like these, please do not overload "query" with column names" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [12:56:17] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063803 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [12:56:50] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: enable ingress traffic to the PG pod API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063803 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [12:57:47] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:00:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1300). [13:00:05] MatmaRex, hnowlan, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:18] hi [13:00:43] my maintenance script will take a few hours in total, please run it in a `screen` or something like that [13:00:44] I can deploy! [13:00:50] ack [13:00:59] MatmaRex: should I start it now or wait for the other deployments first? [13:00:59] (a few minutes per wiki) [13:01:07] my guess would be it’s not affected either way [13:01:08] o/ [13:01:10] Lucas_WMDE: you can start it [13:01:14] ok [13:01:25] my change will only take effect once it hits prod as it affects the jobrunners only [13:01:27] (03CR) 10Kevin Bazira: [C:03+1] services/apigw: drop prefix trim for recommendation-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [13:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P67384 and previous config saved to /var/cache/conftool/dbconfig/20240819-130132-marostegui.json [13:01:47] (03CR) 10Btullis: [C:03+2] "Thanks. I have verified through Slack DM chat that this key belongs to the user." [puppet] - 10https://gerrit.wikimedia.org/r/1063800 (https://phabricator.wikimedia.org/T371894) (owner: 10Btullis) [13:02:07] !log START lucaswerkmeister-wmde@mwmaint1002:~$ foreachwiki maintenance/cleanupTitles.php --prefix=T195546 --reporting-interval=1000000000 2>&1 | tee ~/T195546.log [13:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:14] T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages - https://phabricator.wikimedia.org/T195546 [13:02:22] in a tmux session named T195546 on said host [13:02:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:03:03] thanks! [13:03:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [13:03:21] (03PS2) 10Hnowlan: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T369048) [13:03:31] (03CR) 10EoghanGaffney: vrts: run install script on new server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [13:03:43] \o [13:03:48] hnowlan: did you attach the right task to that change? [13:03:59] to me it looks related but not quite like the task I would’ve expected [13:04:12] My change should be a no-op, so there will be nothing to test. [13:05:00] you could still test that it’s not breaking anything ;) [13:05:21] Sure :D [13:05:46] Lucas_WMDE: good point - I'll update that [13:06:04] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Define wgVirtualDomainsMapping for virtual-checkuser-global (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) (owner: 10Dreamy Jazz) [13:06:19] ok, then I’ll do Dreamy_Jazz’s change in the meantime [13:06:31] (03PS3) 10Hnowlan: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T356241) [13:06:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) (owner: 10Dreamy Jazz) [13:07:15] (03CR) 10Dreamy Jazz: Define wgVirtualDomainsMapping for virtual-checkuser-global (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) (owner: 10Dreamy Jazz) [13:07:18] (03Merged) 10jenkins-bot: Define wgVirtualDomainsMapping for virtual-checkuser-global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059422 (https://phabricator.wikimedia.org/T371724) (owner: 10Dreamy Jazz) [13:07:28] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1059422|Define wgVirtualDomainsMapping for virtual-checkuser-global (T371724)]] [13:07:31] T371724: Define virtual domains configuration for virtual-checkuser-global on WMF wikis - https://phabricator.wikimedia.org/T371724 [13:08:53] (03PS4) 10Hnowlan: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T356241) [13:09:35] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "rdb1014 back to active - cgoubert@cumin1002 - T370633" [13:09:38] T370633: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633 [13:09:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "rdb1014 back to active - cgoubert@cumin1002 - T370633" [13:10:00] hm, I appear to have lost logstash access? [13:10:02] > Service access denied due to missing privileges. [13:10:03] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [13:10:06] Hmm [13:10:07] (on idp.w.o) [13:10:13] * Lucas_WMDE tries grafana [13:10:25] !log lucaswerkmeister-wmde@deploy1003 dreamyjazz, lucaswerkmeister-wmde: Backport for [[gerrit:1059422|Define wgVirtualDomainsMapping for virtual-checkuser-global (T371724)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:30] same there [13:10:32] not great [13:10:41] Odd [13:10:42] o_O apparently I can type bold in this client. didn’t even realize [13:10:48] guess I’ll make a task for mah access [13:11:07] in the meantime, Dreamy_Jazz want to quickly check that nothing broke on mwdebug? ^^ [13:11:12] Sure. [13:11:18] MatmaRex: maint script currently at bnwiki btw [13:11:28] nice [13:11:35] (03PS7) 10Ebrahim: Enable the dark mode in Portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063763 [13:12:20] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:12:59] 10SRE-Access-Requests: Lost access to Logstash, Grafana, probably more (idp.wikimedia.org) - https://phabricator.wikimedia.org/T372762 (10Lucas_Werkmeister_WMDE) 03NEW [13:13:08] filed ^ for my access problems [13:13:09] Lucas_WMDE: Editing still works, so I think that should be good. The config isn't used anywhere, so probably should be enough to test editing. [13:13:14] !log lucaswerkmeister-wmde@deploy1003 dreamyjazz, lucaswerkmeister-wmde: Continuing with sync [13:13:17] ok, thanks! [13:13:43] Also checked the mwdebug logs on logstash to double check [13:13:44] 10SRE-Access-Requests: Lost access to Logstash, Grafana, probably more (idp.wikimedia.org) - https://phabricator.wikimedia.org/T372762#10073146 (10Lucas_Werkmeister_WMDE) [13:13:56] thanks, I can’t do that at the moment :D [13:16:15] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1015.eqiad.wmnet with OS bookworm [13:16:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T367856)', diff saved to https://phabricator.wikimedia.org/P67385 and previous config saved to /var/cache/conftool/dbconfig/20240819-131640-marostegui.json [13:16:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 7:00:00 on db2162.codfw.wmnet with reason: Maintenance [13:16:44] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:16:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 7:00:00 on db2162.codfw.wmnet with reason: Maintenance [13:17:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T367856)', diff saved to https://phabricator.wikimedia.org/P67386 and previous config saved to /var/cache/conftool/dbconfig/20240819-131702-marostegui.json [13:17:16] wait [13:17:21] am I an idiot? [13:17:46] I am an idiot [13:17:51] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1059422|Define wgVirtualDomainsMapping for virtual-checkuser-global (T371724)]] (duration: 10m 23s) [13:17:54] T371724: Define virtual domains configuration for virtual-checkuser-global on WMF wikis - https://phabricator.wikimedia.org/T371724 [13:18:19] Thanks for the deploy [13:18:47] (T372762 resolved, I was just using the wrong account >.<) [13:18:47] T372762: Lost access to Logstash, Grafana, probably more (idp.wikimedia.org) - https://phabricator.wikimedia.org/T372762 [13:18:48] 10SRE-Access-Requests: Lost access to Logstash, Grafana, probably more (idp.wikimedia.org) - https://phabricator.wikimedia.org/T372762#10073176 (10Lucas_Werkmeister_WMDE) 05Open→03Invalid I had the wrong password manager tab open and was trying to log in as @LucasWerkmeister instead of @Lucas_Werkmeister... [13:18:49] Dreamy_Jazz: np [13:19:20] (03CR) 10TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:20:08] (03Merged) 10jenkins-bot: (de|uk|ja|he|fi)wiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062979 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:20:19] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1062979|(de|uk|ja|he|fi)wiki: enable shellbox-video (T356241)]] [13:20:23] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:21:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:22:15] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Backport for [[gerrit:1062979|(de|uk|ja|he|fi)wiki: enable shellbox-video (T356241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:23] hnowlan: anything to test on mwdebug? [13:22:38] Lucas_WMDE: sadly no [13:22:46] oh right you mentioned jobrunners didn’t you [13:22:47] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, hnowlan: Continuing with sync [13:22:50] I forgot ^^ [13:23:02] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s6 [13:23:08] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1015.eqiad.wmnet,service=s4 [13:24:24] (03PS1) 10Brouberol: cloudnative-pg-cluster: enable ingress from the join pod to PG/5432 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063808 (https://phabricator.wikimedia.org/T372286) [13:27:16] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1062979|(de|uk|ja|he|fi)wiki: enable shellbox-video (T356241)]] (duration: 06m 57s) [13:27:29] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:27:35] hnowlan: should be done [13:28:28] (cleanupTitles is now “processing page...” of commonswiki, which I hear is one of the larger wikis) [13:28:56] Lucas_WMDE: thank you! [13:31:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:31:52] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:34:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:34:18] !log UTC afternoon backport+config window done (except for the T195546 maintenance script which is expected to keep running for a few more hours, currently at commonswiki) [13:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:21] T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages - https://phabricator.wikimedia.org/T195546 [13:34:57] hnowlan: ^ AppserversUnreachable [13:35:27] ack, ty [13:38:45] I'm not sure I understand this. Looks like they're still up [13:39:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:39:29] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767 (10karapayneWMDE) 03NEW [13:41:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10073261 (10Jhancock.wm) [13:44:16] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [13:51:05] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#10073302 (10jhathaway) [13:53:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:57:57] (03PS1) 10Brouberol: airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) [13:58:26] (03PS3) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) [13:58:26] (03PS2) 10Andrea Denisse: alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) [13:58:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10073333 (10Papaul) @Dwisehaupt all working thank you. [13:58:41] (03PS1) 10Hnowlan: Revert "changeprop-jobqueue: reduce refreshLinks concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063815 [13:58:54] (03CR) 10CI reject: [V:04-1] airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [13:59:13] (03CR) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [14:01:07] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063815 (owner: 10Hnowlan) [14:03:37] (03PS2) 10Brouberol: airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) [14:04:09] (03CR) 10Btullis: [C:03+2] gobblin: remove webrequest_frontend ingestion job. [puppet] - 10https://gerrit.wikimedia.org/r/1062671 (https://phabricator.wikimedia.org/T372456) (owner: 10Gmodena) [14:07:15] (03PS1) 10Btullis: Revert "gobblin: remove webrequest_frontend ingestion job." [puppet] - 10https://gerrit.wikimedia.org/r/1063816 [14:08:50] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [14:09:44] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: test I946dd0b73b6be2d6b8093f03550f78d76188b92b with dummy upgrade [14:10:39] (03CR) 10Jelto: [C:04-1] "this fails with "spicerack.icinga.IcingaError: Host gitlab-replica-b was not found in Icinga status - no hosts have been downtimed." when " [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [14:12:19] (03CR) 10Btullis: [C:03+2] Revert "gobblin: remove webrequest_frontend ingestion job." [puppet] - 10https://gerrit.wikimedia.org/r/1063816 (owner: 10Btullis) [14:12:57] commons is fixing pages like 106051803 [14:13:03] (03CR) 10Bking: [V:03+1 C:03+1] airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:13:05] * Lucas_WMDE checks Special:NewPages for how many digits current page IDs have [14:13:32] ok, same number of digits starting with 15…, so commons is about ⅔ done [14:15:31] (03CR) 10Bking: [C:03+1] airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:18:57] (03PS1) 10Btullis: Absent the webrequest_frontend_rc0 gobblin job [puppet] - 10https://gerrit.wikimedia.org/r/1063819 (https://phabricator.wikimedia.org/T372456) [14:18:59] (03PS1) 10Btullis: Remove the webrequest_frontend_rc0 gobblin job [puppet] - 10https://gerrit.wikimedia.org/r/1063820 (https://phabricator.wikimedia.org/T372456) [14:21:50] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10073457 (10jhathaway) p:05Triage→03Low [14:22:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10073458 (10jhathaway) p:05Triage→03Low [14:22:09] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10073460 (10jhathaway) p:05Triage→03Low [14:22:29] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248#10073461 (10ayounsi) p:05Triage→03Low [14:22:38] (03CR) 10Brouberol: [C:03+2] airflow: initialize the DB via a hook, at first install [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063814 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:22:41] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#10073462 (10jhathaway) a:03jhathaway [14:22:58] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10073467 (10jhathaway) p:05Low→03Medium [14:23:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10073468 (10jhathaway) p:05Low→03Medium [14:23:13] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10073482 (10jhathaway) p:05Low→03Medium [14:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:27:42] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5:00:00 on wdqs[1022,1024].eqiad.wmnet with reason: noisy alerts, will look at later in the day [14:27:42] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: enable oidc auth [puppet] - 10https://gerrit.wikimedia.org/r/1063761 (https://phabricator.wikimedia.org/T326657) (owner: 10Filippo Giunchedi) [14:27:59] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on wdqs[1022,1024].eqiad.wmnet with reason: noisy alerts, will look at later in the day [14:28:25] (03CR) 10Hnowlan: [C:03+2] Revert "changeprop-jobqueue: reduce refreshLinks concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063815 (owner: 10Hnowlan) [14:29:22] (03Merged) 10jenkins-bot: Revert "changeprop-jobqueue: reduce refreshLinks concurrency" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063815 (owner: 10Hnowlan) [14:29:29] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1300.eqiad.wmnet with OS bullseye [14:29:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1301.eqiad.wmnet with OS bullseye [14:29:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1303.eqiad.wmnet with OS bullseye [14:29:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1300.eqiad.wmnet with OS bullseye [14:29:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1301.eqiad.wmnet with OS bullseye [14:29:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1303.eqiad.wmnet with OS bullseye [14:30:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: sre.hosts.reimage failing due to mkfs.ext4 taking to long - https://phabricator.wikimedia.org/T372648#10073512 (10SLyngshede-WMF) p:05Triage→03Medium a:03SLyngshede-WMF It's probably enough to bump the default timeout as a quick fix. I'll take a l... [14:30:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bull... [14:30:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bull... [14:30:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073518 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bull... [14:30:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073519 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bullseye... [14:30:27] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bullseye... [14:30:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bullseye... [14:30:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:30:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:30:52] (03PS8) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [14:30:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1300.eqiad.wmnet with OS bullseye [14:30:56] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1301.eqiad.wmnet with OS bullseye [14:30:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1303.eqiad.wmnet with OS bullseye [14:31:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073536 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bull... [14:31:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bull... [14:31:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bullseye [14:31:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bull... [14:31:16] (03CR) 10CI reject: [V:04-1] sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:31:22] (03CR) 10AOkoth: "Acknowledged." [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:31:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1304.eqiad.wmnet with OS bullseye [14:31:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1302.eqiad.wmnet with OS bull... [14:31:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073542 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1304.eqiad.wmnet with OS bull... [14:32:08] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:32:15] (03PS9) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [14:32:38] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:32:39] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:32:49] (03CR) 10CI reject: [V:04-1] sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [14:33:26] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:35:10] (03CR) 10Btullis: [C:03+2] Absent the webrequest_frontend_rc0 gobblin job [puppet] - 10https://gerrit.wikimedia.org/r/1063819 (https://phabricator.wikimedia.org/T372456) (owner: 10Btullis) [14:36:11] cleanupTitles made it through commonswiki \o/ [14:37:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:37:24] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:39:26] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:24] 07sre-alert-triage, 06Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr1-esams) - https://phabricator.wikimedia.org/T372248#10073647 (10ayounsi) 05Open→03Resolved Peer removed. [14:44:39] (03PS1) 10Filippo Giunchedi: Revert "prometheus: enable oidc auth" [puppet] - 10https://gerrit.wikimedia.org/r/1063821 [14:44:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10073649 (10VRiley-WMF) Worked with Dell on 1298. They have determined it will require a motherboard swap. Parts will be coming tomorrow. (Dell ticket 196... [14:45:05] (03CR) 10CI reject: [V:04-1] Revert "prometheus: enable oidc auth" [puppet] - 10https://gerrit.wikimedia.org/r/1063821 (owner: 10Filippo Giunchedi) [14:45:11] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] Revert "prometheus: enable oidc auth" [puppet] - 10https://gerrit.wikimedia.org/r/1063821 (owner: 10Filippo Giunchedi) [14:46:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk failed on ms-be1079 - https://phabricator.wikimedia.org/T372560#10073656 (10VRiley-WMF) Worked with Dell on this, they will be shipping out a new HDD which will arrive tomorrow. (Dell ticket 196124764) [14:47:22] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781 (10ayounsi) 03NEW p:05Triage→03High [14:47:32] MatmaRex: on enwiki the script is printing stuff like “Couldn't legalize; form 'T195546/User_talk:195.175.37.8' exists; using 'T195546/id: 11429818'” [14:47:33] T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages - https://phabricator.wikimedia.org/T195546 [14:47:39] maybe we shouldn’t have run it there a second time? [14:48:19] Lucas_WMDE: what do you mean by "second time"? i don't think it was run [14:48:27] (03PS10) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [14:48:29] didn’t I run the script on enwiki a few days ago? [14:48:33] or was it a different one [14:48:34] those messages seem expected to me [14:48:39] Lucas_WMDE: no, it was hewikisource IIRC [14:48:46] 10ops-codfw, 06DC-Ops: asw-c7-codfw: PEM 0 is not powered - https://phabricator.wikimedia.org/T372782 (10ayounsi) 03NEW p:05Triage→03High [14:48:52] which had 40 thousand of those pages for some reason [14:48:56] ah, hewikisource indeed https://sal.toolforge.org/log/-ivUS5EBKFqumxvtjeB8 [14:48:58] huh [14:49:00] but othr wikis have like 10 each (not thousands) [14:49:11] yeah it’s not much output [14:49:22] but I didn’t see “Couldn't legalize” in the output for any other wiki [14:49:26] FIRING: ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#prometheus1005:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:34] so I guess I just hallucinated which other wiki the script was already run on [14:49:37] i think it's expected, just an unusual scenario [14:49:43] bc why else would the form with T195546 already exist [14:49:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage [14:49:43] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage [14:49:50] but ok :) [14:49:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage [14:49:54] I’ll post the full log once it’s done [14:49:56] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage [14:49:57] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage [14:50:00] hmm [14:50:08] (03PS1) 10Brouberol: airflow: add missing spec and volumes to the initdb job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063823 (https://phabricator.wikimedia.org/T372286) [14:50:14] maybe the message is wrong, worth looking into after [14:51:23] Lucas_WMDE: there may have been two invalid titles that are cleaned up into the same title [14:51:47] * Lucas_WMDE looks [14:52:01] yeah one with .008 and one with .8 in the IP it looks like [14:52:23] actually, with .08 too [14:52:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1300.eqiad.wmnet with reason: host reimage [14:52:57] https://paste.toolforge.org/view/1f04f23e if you’re curious [14:54:26] FIRING: [3x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1303.eqiad.wmnet with reason: host reimage [14:55:43] FIRING: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:55:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-sd2001'] [14:55:47] (03CR) 10Bking: [C:03+1] airflow: add missing spec and volumes to the initdb job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063823 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:55:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-sd2003'] [14:56:12] (03PS1) 10Btullis: Disable wiping LVM signatures on cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1063824 (https://phabricator.wikimedia.org/T372783) [14:57:31] (03CR) 10Brouberol: [C:03+2] airflow: add missing spec and volumes to the initdb job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063823 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:57:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1304.eqiad.wmnet with reason: host reimage [14:58:01] prometheus is me btw [14:59:26] RESOLVED: [4x] ProbeDown: Service prometheus1005:443 has failed probes (http_prometheus_eqiad_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:00:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1302.eqiad.wmnet with reason: host reimage [15:02:11] (03PS2) 10Scott French: Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) [15:02:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-sd2003'] [15:02:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-sd2001'] [15:04:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1301.eqiad.wmnet with reason: host reimage [15:05:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:08:08] (03CR) 10Scott French: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [15:08:50] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:12:16] (03PS1) 10Brouberol: airflow: run the initdb hook post-installation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063829 [15:13:18] (03CR) 10Btullis: [C:03+1] airflow: run the initdb hook post-installation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063829 (owner: 10Brouberol) [15:13:38] (03CR) 10Brouberol: [C:03+2] airflow: run the initdb hook post-installation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063829 (owner: 10Brouberol) [15:13:41] (03CR) 10CI reject: [V:04-1] Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [15:13:57] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:16:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:17:30] (03PS3) 10Scott French: Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) [15:18:56] (03CR) 10BCornwall: "I too wasn't able to find anything. I'll ask a few others to spot check me." [puppet] - 10https://gerrit.wikimedia.org/r/1063069 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [15:19:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:21:14] (03PS1) 10Brouberol: airflow: run the initdb job outside of hooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063830 (https://phabricator.wikimedia.org/T372286) [15:22:33] (03CR) 10Brouberol: [C:03+2] airflow: run the initdb job outside of hooks [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063830 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [15:24:07] (03CR) 10Filippo Giunchedi: "Is there an host you tested this on? I'd like to take a look at the final result" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:24:26] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:53] 10ops-codfw, 06DC-Ops: asw-c7-codfw: PEM 0 is not powered - https://phabricator.wikimedia.org/T372782#10073874 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm found with on and green status light. reseated power cable and psu. alert confirmed cleared. [15:24:58] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:25:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:25:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-sd2002'] [15:25:33] (03PS1) 10Papaul: Add new codfw ganeti node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1063832 (https://phabricator.wikimedia.org/T365651) [15:25:42] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:26:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logging-sd2002'] [15:26:24] (03CR) 10Papaul: [C:03+2] Add new codfw ganeti node to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1063832 (https://phabricator.wikimedia.org/T365651) (owner: 10Papaul) [15:27:28] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:29:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-sd2003'] [15:29:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logging-sd2003'] [15:30:05] jan_drewniak: #bothumor I � Unicode. All rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1530). [15:30:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-sd2001.codfw.wmnet with OS bookworm [15:30:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-sd2002.codfw.wmnet with OS bookworm [15:30:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-sd2003.codfw.wmnet with OS bookworm [15:30:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10073903 (10Papaul) @Jhancock.wm Moritz is out, I added those nodes to site.pp for you [15:30:51] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10073912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host logging-sd2001.codfw.wmnet with OS bookworm [15:30:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:30:54] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10073913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host logging-sd2002.codfw.wmnet with OS bookworm [15:30:56] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10073914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host logging-sd2003.codfw.wmnet with OS bookworm [15:31:08] cleanupTItles finished with enwiki btw, now at enwiktionary [15:32:53] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp6016*} and A:cp for 9.2.5-1wm2 [15:33:15] (03PS2) 10Btullis: Remove the webrequest_frontend_rc0 gobblin job [puppet] - 10https://gerrit.wikimedia.org/r/1063820 (https://phabricator.wikimedia.org/T372456) [15:34:55] (03PS1) 10Brouberol: airflow: run the initdb as an initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063836 (https://phabricator.wikimedia.org/T372286) [15:35:07] (03PS1) 10Ladsgroup: Reduce rate-limit for trusted editors of commons to 1500 every 3m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063837 (https://phabricator.wikimedia.org/T370304) [15:35:40] (03CR) 10CI reject: [V:04-1] airflow: run the initdb as an initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063836 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [15:36:26] (03PS2) 10Brouberol: airflow: run the initdb as an initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063836 (https://phabricator.wikimedia.org/T372286) [15:36:39] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp6016*} and A:cp for 9.2.5-1wm2 [15:39:11] (03CR) 10AOkoth: "Yes: https://puppet-compiler.wmflabs.org/output/1063766/3681/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:39:34] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2035.codfw.wmnet [reason: T372160] [15:39:37] T372160: ManagementSSHDown - https://phabricator.wikimedia.org/T372160 [15:40:10] (03CR) 10Clément Goubert: [C:03+1] Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [15:40:20] 10ops-codfw, 06SRE, 06DC-Ops: cp2035 ManagementSSHDown - https://phabricator.wikimedia.org/T372160#10073995 (10ssingh) [15:41:41] godog: silly question but where can I find the phaultfinder code? [15:41:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10074000 (10Andrew) 05Resolved→03Open @cmooney says about cloudcephosd1036: > there is no sfp in port 21 on cloudsw1-d5-eqiad h... [15:41:57] I looked in the usual places and don't see it. feature request is to add the hostname to the task title so I might as well send a CR for it :) [15:42:24] (03PS1) 10Btullis: Fix the absenting of this gobblin test resource [puppet] - 10https://gerrit.wikimedia.org/r/1063838 (https://phabricator.wikimedia.org/T372456) [15:43:15] (03CR) 10BCornwall: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [15:43:56] (03CR) 10Clément Goubert: Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [15:45:00] (03CR) 10Clément Goubert: [C:03+1] Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [15:45:06] sukhe: you are right it isn't very discoverable atm, operations/debs/phalerts [15:45:19] sukhe: see also T351389 [15:45:19] T351389: Explore a simpler way to tweak opened task titles from alertmanager - https://phabricator.wikimedia.org/T351389 [15:45:23] (03CR) 10CDanis: [C:03+1] Reduce rate-limit for trusted editors of commons to 1500 every 3m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063837 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [15:45:57] cleanupTitles has now reached frwiki where there’s quite a bit to do [15:45:57] godog: ah as I suspected, more involved than I was foreseeing :) [15:45:58] thanks! [15:46:07] sure np, thanks for reaching out sukhe [15:46:25] (very helpful bot) [15:46:29] (03CR) 10Brouberol: [C:03+2] airflow: run the initdb as an initcontainer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063836 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [15:46:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2003.codfw.wmnet with reason: host reimage [15:46:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2001.codfw.wmnet with reason: host reimage [15:49:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:49:32] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:49:32] (frwiki done, 49 rows were updated there) [15:50:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2003.codfw.wmnet with reason: host reimage [15:50:25] (03CR) 10AOkoth: sql_exporter: specify column for metric (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:50:34] (03PS11) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [15:50:58] (03CR) 10CI reject: [V:04-1] sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [15:52:45] (03CR) 10Samtar: [C:04-2] "Hold until `1.43.0-wmf.20`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1062977 (https://phabricator.wikimedia.org/T372527) (owner: 10Samtar) [15:53:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2001.codfw.wmnet with reason: host reimage [15:56:16] (03PS1) 10Ahmon Dancy: Remove old image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1063842 (https://phabricator.wikimedia.org/T371904) [15:58:04] (03PS2) 10Ahmon Dancy: Remove old image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1063842 (https://phabricator.wikimedia.org/T371904) [15:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:45] (03PS12) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [16:02:50] (03CR) 10CDanis: [C:03+1] Add pbuilder hook for component/php81 [puppet] - 10https://gerrit.wikimedia.org/r/1062754 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [16:02:53] (03CR) 10CDanis: [C:03+1] Add component/php81 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [16:02:58] (03CR) 10Krinkle: [C:03+1] Reduce rate-limit for trusted editors of commons to 1500 every 3m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063837 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:04:08] cleanupTitles now at itwiki [16:05:36] (03PS1) 10Stevemunene: trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) [16:05:57] huh, and itwiki already done [16:06:12] (03CR) 10CI reject: [V:04-1] trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [16:06:24] (03CR) 10AOkoth: sql_exporter: specify column for metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [16:06:39] (03PS1) 10Ryan Kemper: wdqs: teardown experimental endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1063849 (https://phabricator.wikimedia.org/T371833) [16:06:39] then I expect the script should be somewhere over halfway done already, a majority of the biggest wikis are earlier in the alphabet [16:06:51] (though some big ones are still coming, like ruwiki or zhwiki) [16:07:08] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye [16:08:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:08:35] (03PS2) 10Stevemunene: trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) [16:13:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:13:34] (03PS1) 10Thcipriani: DNM: Testing the patchdemo running on k8s! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063850 (https://phabricator.wikimedia.org/T371822) [16:14:56] (03Abandoned) 10Thcipriani: DNM: Testing the patchdemo running on k8s! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063850 (https://phabricator.wikimedia.org/T371822) (owner: 10Thcipriani) [16:15:32] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1063766/3682/" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [16:17:24] (03CR) 10Btullis: [C:03+2] Fix the absenting of this gobblin test resource [puppet] - 10https://gerrit.wikimedia.org/r/1063838 (https://phabricator.wikimedia.org/T372456) (owner: 10Btullis) [16:18:38] jouncebot: nowandnext [16:18:39] No deployments scheduled for the next 0 hour(s) and 41 minute(s) [16:18:39] In 0 hour(s) and 41 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1700) [16:18:39] In 0 hour(s) and 41 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1700) [16:18:42] cool [16:18:49] (03CR) 10Ladsgroup: [C:03+2] Reduce rate-limit for trusted editors of commons to 1500 every 3m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063837 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:19:34] (03Merged) 10jenkins-bot: Reduce rate-limit for trusted editors of commons to 1500 every 3m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063837 (https://phabricator.wikimedia.org/T370304) (owner: 10Ladsgroup) [16:19:49] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1063837|Reduce rate-limit for trusted editors of commons to 1500 every 3m (T370304)]] [16:20:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:20:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2003.codfw.wmnet with OS bookworm [16:20:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:20:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2001.codfw.wmnet with OS bookworm [16:20:41] (03PS1) 10Dreamy Jazz: Enable temporary accounts on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063852 (https://phabricator.wikimedia.org/T372784) [16:21:02] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10074205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host logging-sd2003.codfw.wmnet with OS bookworm completed: - logging-sd... [16:21:05] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10074206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host logging-sd2001.codfw.wmnet with OS bookworm completed: - logging-sd... [16:21:52] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1063837|Reduce rate-limit for trusted editors of commons to 1500 every 3m (T370304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:21:56] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [16:22:16] Amir1: Could you ping me when you are done. Want to make a config change to beta wikis. [16:22:27] I can just merge it for you [16:22:35] https://gerrit.wikimedia.org/r/1063852 ? [16:22:43] AFAIK you had to run `scap backport` based on https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview#MediaWiki_config [16:22:54] nah [16:22:56] i.e. "will still need to be merged on production to prevent alerts from being triggered" [16:23:01] Sure. I can +2 it. [16:23:11] It needs a rebase only [16:23:12] (03CR) 10Dreamy Jazz: [C:03+2] Enable temporary accounts on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063852 (https://phabricator.wikimedia.org/T372784) (owner: 10Dreamy Jazz) [16:23:17] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2022.codfw.wmnet with OS bullseye [16:23:20] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2023.codfw.wmnet with OS bullseye [16:23:24] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [16:23:31] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2025.codfw.wmnet with OS bullseye [16:23:44] (03PS13) 10AOkoth: sql_exporter: specify column for metric [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) [16:23:54] (03Merged) 10jenkins-bot: Enable temporary accounts on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063852 (https://phabricator.wikimedia.org/T372784) (owner: 10Dreamy Jazz) [16:25:33] (03CR) 10Ryan Kemper: [C:03+2] wdqs: teardown experimental endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1063849 (https://phabricator.wikimedia.org/T371833) (owner: 10Ryan Kemper) [16:25:47] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [16:26:22] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063837|Reduce rate-limit for trusted editors of commons to 1500 every 3m (T370304)]] (duration: 06m 33s) [16:27:30] Dreamy_Jazz: rebased in production, you should be good to go (the automatic every ten minute scap deploy) [16:27:40] Thanks. [16:27:41] (to beta) [16:28:57] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [16:33:08] (03CR) 10Scott French: [C:03+2] "Thanks, Ahmon!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060182 (owner: 10Ahmon Dancy) [16:36:37] (03CR) 10Scott French: [C:03+2] Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [16:36:47] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet [reason: [done] T372160] [16:36:50] T372160: cp2035 ManagementSSHDown - https://phabricator.wikimedia.org/T372160 [16:36:59] 10ops-codfw, 06SRE, 06DC-Ops: cp2035 ManagementSSHDown - https://phabricator.wikimedia.org/T372160#10074382 (10Jhancock.wm) 05Open→03Resolved hard cycled the server. powered up with no issues. no errors on idrac. version is 4.something. updated it to 7.0. ready to be added back. [16:37:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2002.codfw.wmnet with reason: host reimage [16:38:36] (03CR) 10Scott French: [V:03+2 C:03+2] php7.4-fpm-multiversion-base: Fix a couple of typos [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060182 (owner: 10Ahmon Dancy) [16:38:51] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet, repooling neither afterwards [16:38:54] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [16:40:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2002.codfw.wmnet with reason: host reimage [16:40:59] (03PS9) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [16:41:34] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet, repooling neither afterwards [16:42:15] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet, repooling neither afterwards [16:42:41] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [16:43:12] (03CR) 10Scott French: "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [16:43:56] (03CR) 10Scott French: [C:03+2] Add component/php81 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1062753 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [16:47:05] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800 (10RobH) 03NEW [16:47:35] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10074438 (10RobH) [16:47:42] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye [16:49:33] 10ops-codfw, 06DC-Ops, 06serviceops: Q1:rack/setup/install mc-misc200[12] - https://phabricator.wikimedia.org/T372800#10074457 (10RobH) a:03jijiki @jijiki, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new ser... [16:52:20] (03Merged) 10jenkins-bot: Remove parsoid-php from MEDIAWIKI_SERVICES [cookbooks] - 10https://gerrit.wikimedia.org/r/1063261 (https://phabricator.wikimedia.org/T359387) (owner: 10Scott French) [16:57:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:57:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1300.eqiad.wmnet with OS bullseye [16:57:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:57:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1301.eqiad.wmnet with OS bullseye [16:57:39] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:57:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074502 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bullseye... [16:57:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1302.eqiad.wmnet with OS bullseye [16:57:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074503 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bullseye... [16:57:45] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074504 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1302.eqiad.wmnet with OS bullseye... [16:57:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:57:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:57:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1303.eqiad.wmnet with OS bullseye [16:57:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074505 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bullseye... [16:57:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:57:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1304.eqiad.wmnet with OS bullseye [16:58:04] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1304.eqiad.wmnet with OS bullseye... [16:59:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074509 (10Jclark-ctr) [16:59:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074512 (10Jclark-ctr) [17:00:34] (03CR) 10Scott French: [C:03+2] mediawiki: consistently apply stats-global values via symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265) (owner: 10Scott French) [17:00:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074514 (10Jclark-ctr) [17:02:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1286.eqiad.wmnet with OS bullseye [17:02:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1285.eqiad.wmnet with OS bullseye [17:02:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074521 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1286.eqiad.wmnet with OS bull... [17:02:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1285.eqiad.wmnet with OS bull... [17:02:47] (03Merged) 10jenkins-bot: mediawiki: consistently apply stats-global values via symlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063031 (https://phabricator.wikimedia.org/T365265) (owner: 10Scott French) [17:02:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:02:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2002.codfw.wmnet with OS bookworm [17:03:06] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10074524 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host logging-sd2002.codfw.wmnet with OS bookworm completed: - logging-sd... [17:03:45] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10074527 (10Jhancock.wm) [17:08:18] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:08:42] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:09:04] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:09:18] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:11:35] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [17:13:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:14:14] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:14:16] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:14:23] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10074575 (10Jhancock.wm) [17:14:30] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:14:32] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:14:53] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:14:54] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-misc: apply [17:15:09] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [17:15:10] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:15:28] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:15:29] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:15:53] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:15:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:16:13] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:18:19] * swfrench-wmf wishes the helmfile logging hooks made clear which release(s) are involved [17:18:51] FYI, these are only touching the statsd exporter (not mediawiki) ^^ [17:19:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [17:20:00] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [17:22:59] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10074610 (10Dzahn) Thanks @Papaul! note from today's collab team meeting: We defined codfw as the home for gerrit and eqiad as the home for phab/phorge. So that means... [17:23:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1286.eqiad.wmnet with reason: host reimage [17:25:36] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:25:56] 06SRE, 06collaboration-services: setup gerrit2003 with gerrit service - https://phabricator.wikimedia.org/T372804 (10Dzahn) 03NEW [17:25:56] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:25:57] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:26:11] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:26:12] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:26:28] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:26:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [17:26:43] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [17:26:44] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:27:04] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:27:05] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:27:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1285.eqiad.wmnet with reason: host reimage [17:27:22] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:27:23] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:27:35] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:29:00] !log statsd-exporter resource bumps (https://gerrit.wikimedia.org/r/1061856) are now everywhere - T371885 [17:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:03] T371885: Gaps in Grafana graphs using Thanos - https://phabricator.wikimedia.org/T371885 [17:29:51] (03CR) 10Scott French: [C:03+2] mw-debug: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063032 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:31:12] (03Merged) 10jenkins-bot: mw-debug: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063032 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:33:36] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-scholarly journal) xfer scholarly_articles from wdqs1023.eqiad.wmnet -> wdqs1024.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [17:33:39] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [17:36:09] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:36:35] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:38:36] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:38:51] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:40:39] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:41:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10074692 (10VRiley-WMF) plugged the port in and also reseated management cable [17:42:57] !log mforns@deploy1003 Started deploy [analytics/refinery@9eaecec]: Regular analytics weekly train [analytics/refinery@9eaecec7] [17:44:14] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2022.codfw.wmnet with OS bullseye [17:44:26] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2025.codfw.wmnet with OS bullseye [17:44:45] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2023.codfw.wmnet with OS bullseye [17:45:41] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:48:36] (03PS2) 10Scott French: mw-api-int: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) [17:48:36] (03PS2) 10Scott French: mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366) [17:50:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:50:14] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1285.eqiad.wmnet with OS bullseye [17:50:19] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:50:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1286.eqiad.wmnet with OS bullseye [17:50:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074735 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1285.eqiad.wmnet with OS bullseye... [17:50:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1286.eqiad.wmnet with OS bullseye... [17:50:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074738 (10Jclark-ctr) [17:52:23] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bullseye [17:52:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10074750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bull... [17:53:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [17:53:32] (03CR) 10Scott French: "Thanks again for reviewing the analogous change to mw-debug. That's now live, and it appears to be producing metrics as expected." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:53:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:55:28] !log mforns@deploy1003 Finished deploy [analytics/refinery@9eaecec]: Regular analytics weekly train [analytics/refinery@9eaecec7] (duration: 12m 30s) [17:55:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [18:03:03] (03CR) 10Dzahn: [V:03+1 C:03+2] durum: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059152 (owner: 10Dzahn) [18:05:25] !log mforns@deploy1003 Started deploy [analytics/refinery@9eaecec] (thin): Regular analytics weekly train THIN [analytics/refinery@9eaecec7] [18:08:14] cleanupTitles finally made its way through wikidatawiki (took a while) and has arrived at zhwiki [18:08:55] (03CR) 10Scott French: [C:03+2] Add pbuilder hook for component/php81 [puppet] - 10https://gerrit.wikimedia.org/r/1062754 (https://phabricator.wikimedia.org/T372507) (owner: 10Scott French) [18:08:57] that's pretty quick [18:09:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage [18:09:50] !log mforns@deploy1003 Finished deploy [analytics/refinery@9eaecec] (thin): Regular analytics weekly train THIN [analytics/refinery@9eaecec7] (duration: 04m 25s) [18:10:16] !log mforns@deploy1003 Started deploy [analytics/refinery@9eaecec] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9eaecec7] [18:12:13] !log FINISHED lucaswerkmeister-wmde@mwmaint1002:~$ foreachwiki maintenance/cleanupTitles.php --prefix=T195546 --reporting-interval=1000000000 2>&1 | tee ~/T195546.log [18:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:16] T195546: Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages - https://phabricator.wikimedia.org/T195546 [18:12:53] (03CR) 10Brennen Bearnes: [C:03+1] gitlab: add option to serve a robots.txt and enable it in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [18:12:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1299.eqiad.wmnet with reason: host reimage [18:13:40] !log mforns@deploy1003 Finished deploy [analytics/refinery@9eaecec] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@9eaecec7] (duration: 03m 24s) [18:14:09] Remember to post the cleanupTitles log file to the Phabricator task [18:15:09] I just did :) [18:15:25] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814 (10Andrew) 03NEW [18:19:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [18:20:07] (03PS1) 10Andrew Bogott: Put cloudcephosd1036 into service [puppet] - 10https://gerrit.wikimedia.org/r/1063861 (https://phabricator.wikimedia.org/T363344) [18:22:19] (03CR) 10Bking: [C:03+1] cloudnative-pg-cluster: enable ingress from the join pod to PG/5432 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063808 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [18:24:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q1:rack/setup/install ml-serve1009-1011 (3x), ml-lab1001-1002 (2x), dse-k8s-worker1009 (1x) - https://phabricator.wikimedia.org/T372432#10074927 (10VRiley-WMF) [18:29:58] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:31:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [18:31:28] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10074987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [18:32:47] (03PS1) 10Ssingh: P:durum: return the host for the check service [puppet] - 10https://gerrit.wikimedia.org/r/1063864 [18:33:26] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3683/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063864 (owner: 10Ssingh) [18:35:10] (03PS2) 10Ssingh: P:durum: return the host for the check service [puppet] - 10https://gerrit.wikimedia.org/r/1063864 [18:35:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10074990 (10Andrew) All three of these need reimaging to get the drive labels set up properly; right now they all have a big OSD drive assigned to the os. [18:35:46] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3684/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063864 (owner: 10Ssingh) [18:37:48] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [18:38:03] (03CR) 10Ssingh: [V:03+1 C:03+2] P:durum: return the host for the check service [puppet] - 10https://gerrit.wikimedia.org/r/1063864 (owner: 10Ssingh) [18:53:11] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1003 - https://phabricator.wikimedia.org/T372817 (10Dzahn) 03NEW [18:53:19] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1003 - https://phabricator.wikimedia.org/T372817#10075104 (10Dzahn) [18:53:32] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1003 - https://phabricator.wikimedia.org/T372817#10075109 (10Dzahn) [18:53:53] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1003 - https://phabricator.wikimedia.org/T372817#10075110 (10Dzahn) [18:55:33] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1003 - https://phabricator.wikimedia.org/T372817#10075111 (10Dzahn) [18:56:56] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004 as phab1005 - https://phabricator.wikimedia.org/T372817#10075113 (10Dzahn) [18:57:12] 10ops-eqiad, 06collaboration-services, 06DC-Ops: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10075114 (10Dzahn) [18:59:04] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10075116 (10Dzahn) T372817 [19:02:00] (03PS4) 10أنون: [arwikinews]: Upgrade license to CC BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) [19:04:28] (03CR) 10JHathaway: [C:03+1] "looks good, proposed one alternative approach" [puppet] - 10https://gerrit.wikimedia.org/r/1063728 (https://phabricator.wikimedia.org/T372728) (owner: 10Ayounsi) [19:04:29] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:05:51] (03PS1) 10Dzahn: site: rename gerrit1004 to phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1063870 (https://phabricator.wikimedia.org/T372817) [19:06:09] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:07:28] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [19:09:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:09:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:10:06] (03PS5) 10أنون: [arwikinews]: Upgrade license to CC BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) [19:10:53] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [19:11:36] (03PS6) 10أنون: [arwikinews]: Upgrade license to CC BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) [19:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:12:21] andrewbogott: just starting to look, are those cloudsw pages related to ceph reimaging? [19:12:22] (03PS1) 10MusikAnimal: labs: enable line numbering everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063874 (https://phabricator.wikimedia.org/T357482) [19:12:28] (03CR) 10Dzahn: [C:03+1] P:durum: return the host for the check service [puppet] - 10https://gerrit.wikimedia.org/r/1063864 (owner: 10Ssingh) [19:12:35] (03CR) 10NMW03: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063868 (https://phabricator.wikimedia.org/T372730) (owner: 10أنون) [19:12:52] rzl: yes, sorry. They'll clear shortly but may fire again when the second batch pools. [19:13:39] I did my best to ack [19:15:43] (03PS1) 10Ssingh: P:durum: show hostname only in the API, not on the web interface [puppet] - 10https://gerrit.wikimedia.org/r/1063875 [19:16:55] andrewbogott: okay, just to confirm -- you don't have a way to silence the alerts preemptively, but they have no user impact and you don't want us to do anything when they fire? [19:17:27] that's correct. The alerts should be bound to the wmcs team but because of the way the alert is set up it's difficult/impossible to specify [19:17:36] got it [19:18:10] very sorry for the page. 99 times/100 ceph rebalances without overloading the network, I don't think we understand what causes it to go overboard but I know of at least one thing to mitigate it. [19:18:35] it happens :) thanks for the quick response [19:19:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [19:20:03] (03CR) 10Ssingh: [C:03+2] P:durum: show hostname only in the API, not on the web interface [puppet] - 10https://gerrit.wikimedia.org/r/1063875 (owner: 10Ssingh) [19:26:56] 10ops-eqiad, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet - https://phabricator.wikimedia.org/T372817#10075225 (10RobH) [19:28:15] !log sbassett@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [19:28:34] !log sbassett@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [19:29:08] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:29:33] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:32:33] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:06] (03PS1) 10Ahmon Dancy: Bump buildkitd to 0.15.2 [puppet] - 10https://gerrit.wikimedia.org/r/1063882 (https://phabricator.wikimedia.org/T372820) [19:45:32] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:45:55] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [19:51:39] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [19:52:04] !log dancy@deploy1003 Installing scap version "4.98.0" for 207 hosts [19:52:44] !log dancy@deploy1003 Installation of scap version "4.98.0" completed for 207 hosts [19:52:48] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10075330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm executed... [19:53:08] !log dancy@deploy1003 Started scap sync-world: testing T371904 [19:53:11] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [19:54:54] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [19:55:32] (03PS2) 10Andrew Bogott: Put cloudcephosd1036 into service [puppet] - 10https://gerrit.wikimedia.org/r/1063861 (https://phabricator.wikimedia.org/T363344) [19:55:32] (03PS1) 10Andrew Bogott: Add preseed value for cloudcephosd10039-41 [puppet] - 10https://gerrit.wikimedia.org/r/1063891 [19:55:32] (03PS1) 10Andrew Bogott: Make cloudcephosd1039-1041 into ceph osd nodes [puppet] - 10https://gerrit.wikimedia.org/r/1063892 (https://phabricator.wikimedia.org/T372814) [19:58:06] (03CR) 10CDanis: [C:03+1] confd: fix error state file name in check [puppet] - 10https://gerrit.wikimedia.org/r/1063223 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [19:58:54] (03CR) 10Andrew Bogott: [C:03+2] Add preseed value for cloudcephosd10039-41 [puppet] - 10https://gerrit.wikimedia.org/r/1063891 (owner: 10Andrew Bogott) [19:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:46] (03PS1) 10Dzahn: site: (WIP) try applying gerrit role on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) [19:59:49] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:12] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1063758/3685/gitlab2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:00:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [20:04:22] (03CR) 10Pppery: "Gerrit reviewer bot down?" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1063306 (owner: 10Pppery) [20:04:25] (03CR) 10Dzahn: [V:03+1 C:03+1] "unfortunately can't compile it on the test host but that seems to be a general issue with the compilers (https://phabricator.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:06:36] (03CR) 10Dzahn: [V:03+1 C:03+1] "well, we are using our own local puppetmaster here, so it seems the issue with local secrets is back" [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:06:55] (03CR) 10Dzahn: [V:03+1 C:03+2] gitlab: add option to serve a robots.txt and enable it in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:07:28] (03CR) 10Dzahn: [V:03+1 C:03+2] "I'll merge it regardless since I already know the compiler result doesn't reflect reality (claims it fails before this change when it does" [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:10:20] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop in prod. on test instance /srv/robots.txt was created, added to nginx config, puppet restarted service" [puppet] - 10https://gerrit.wikimedia.org/r/1063758 (https://phabricator.wikimedia.org/T372538) (owner: 10Jelto) [20:16:13] (03CR) 10Dzahn: "just one thing - the logs can become pretty large. on another host we had to turn that off again to avoid running out of disk." [puppet] - 10https://gerrit.wikimedia.org/r/1059156 (owner: 10Dzahn) [20:16:28] (03PS2) 10Dzahn: durum: include throttling class, enable it on durum2001, accept/log only [puppet] - 10https://gerrit.wikimedia.org/r/1059156 [20:24:55] (03PS1) 10Dzahn: gerrit: temp set a gerrit IP for testing [puppet] - 10https://gerrit.wikimedia.org/r/1063896 (https://phabricator.wikimedia.org/T372804) [20:25:22] (03PS2) 10Dzahn: gerrit: temp set a gerrit IP for testing, gerrit2003 only [puppet] - 10https://gerrit.wikimedia.org/r/1063896 (https://phabricator.wikimedia.org/T372804) [20:25:35] (03CR) 10Scott French: [C:03+2] "Thanks, Ahmon!" [puppet] - 10https://gerrit.wikimedia.org/r/1063842 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [20:25:45] (03CR) 10Dzahn: [C:03+2] gerrit: temp set a gerrit IP for testing, gerrit2003 only [puppet] - 10https://gerrit.wikimedia.org/r/1063896 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:26:00] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [20:26:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye [20:28:32] (03CR) 10Dzahn: "surprisingly it already seems to work after just one change: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1063896" [puppet] - 10https://gerrit.wikimedia.org/r/1063893 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:30:09] (03CR) 10Scott French: [C:03+2] "Thanks, Chris!" [puppet] - 10https://gerrit.wikimedia.org/r/1063223 (https://phabricator.wikimedia.org/T363924) (owner: 10Scott French) [20:33:17] (03PS1) 10Dzahn: gerrit: on gerrit2003, set firewall provider, admin groups, team owner [puppet] - 10https://gerrit.wikimedia.org/r/1063898 (https://phabricator.wikimedia.org/T372804) [20:33:56] (03PS2) 10Dzahn: gerrit: on gerrit2003, set firewall provider, admin groups, team owner [puppet] - 10https://gerrit.wikimedia.org/r/1063898 (https://phabricator.wikimedia.org/T372804) [20:35:33] (03CR) 10Dzahn: [C:03+2] Bump buildkitd to 0.15.2 [puppet] - 10https://gerrit.wikimedia.org/r/1063882 (https://phabricator.wikimedia.org/T372820) (owner: 10Ahmon Dancy) [20:36:43] (03CR) 10Dzahn: [C:03+2] gerrit: on gerrit2003, set firewall provider, admin groups, team owner [puppet] - 10https://gerrit.wikimedia.org/r/1063898 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [20:42:23] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:44:14] !log mforns@deploy1003 Started deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided) [20:44:22] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [20:44:26] !log mforns@deploy1003 Finished deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided) (duration: 00m 11s) [20:45:08] !log eevans@deploy1003 Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test [20:45:09] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:45:41] !log eevans@deploy1003 Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 32s) [20:46:31] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:48:17] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage [20:49:19] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:49:30] !log sbassett@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [20:49:32] !log sbassett@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [20:49:39] !log sbassett@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [20:49:41] !log sbassett@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [20:52:00] !log Deployed changes from T372570 to security.wikimedia.org (miscweb) [20:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:03] T372570: Username links are broken on security.wikimedia.org - https://phabricator.wikimedia.org/T372570 [20:52:35] (03CR) 10EoghanGaffney: sql_exporter: specify column for metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [20:57:16] !log eevans@deploy1003 Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test [20:57:22] !log eevans@deploy1003 Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 06s) [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T2100). [21:01:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 20.87% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:01:21] (03CR) 10Dzahn: vrts: run install script on new server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [21:06:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:06:56] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [21:07:28] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [21:21:15] (03PS1) 10Dzahn: gerrit: create a temp insetup role to test java install in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1063904 (https://phabricator.wikimedia.org/T372804) [21:21:33] (03CR) 10CI reject: [V:04-1] gerrit: create a temp insetup role to test java install in bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1063904 (https://phabricator.wikimedia.org/T372804) (owner: 10Dzahn) [21:26:32] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [21:30:00] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [21:32:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:37:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.85% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [21:48:44] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [21:50:07] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [21:50:50] deploying a beta-only patch [21:51:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063874 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [21:52:33] (03Merged) 10jenkins-bot: labs: enable line numbering everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063874 (https://phabricator.wikimedia.org/T357482) (owner: 10MusikAnimal) [21:53:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:53:44] (done) [22:00:40] (03CR) 10Stoyofuku-wmf: "Abandoned in favor of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060152" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) (owner: 10Jdlrobson) [22:09:10] !log andrew@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [22:12:06] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [22:30:39] !log andrew@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye [22:58:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 13Patch-For-Review: Put cloudcephosd10[39-41] into service - https://phabricator.wikimedia.org/T372814#10075894 (10Andrew) These are now rebuilt with proper partitioning. They probably shouldn't be bootstrapped until T372821 is resolved. [23:03:46] (03PS1) 10Krinkle: CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) [23:04:00] (03CR) 10Krinkle: Use the statsd-exporter service where available (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [23:05:07] (03PS2) 10Krinkle: CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) [23:06:10] (03PS1) 10Chlod Alejandro: kawikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) [23:07:19] (03PS1) 10Chlod Alejandro: kaawiktionary: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063915 (https://phabricator.wikimedia.org/T368868) [23:08:49] (03PS1) 10Chlod Alejandro: iglwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063916 (https://phabricator.wikimedia.org/T368868) [23:09:56] (03PS1) 10Chlod Alejandro: mywikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063918 (https://phabricator.wikimedia.org/T368868) [23:11:29] (03PS1) 10Chlod Alejandro: kuswiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063919 (https://phabricator.wikimedia.org/T368868) [23:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:12:18] (03PS1) 10Chlod Alejandro: bewwiki: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063920 (https://phabricator.wikimedia.org/T368868) [23:13:29] only a few custom logos then chlod [23:13:32] :p [23:14:16] quite a few [23:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [23:18:38] (n.b.: i added the rest manually since it'd take a while to add everything in through schedule-deployment) [23:29:28] (03CR) 10Cwhite: [C:03+1] alert: Resolve alerts DNS queries to alert1002 [dns] - 10https://gerrit.wikimedia.org/r/1063078 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [23:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063922 [23:38:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063922 (owner: 10TrainBranchBot) [23:39:39] (03CR) 10Cwhite: alert: Ensure the alert[12]001 hosts use the spare::system role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063231 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [23:40:24] (03CR) 10Cwhite: [C:03+1] "+1 for deploy at the appropriate time" [puppet] - 10https://gerrit.wikimedia.org/r/1063233 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [23:41:44] (03CR) 10Cwhite: [C:03+1] "+1 for deploy at the appropriate time" [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [23:42:08] (03CR) 10Cwhite: [C:03+1] alert: Update alertmanager tests hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1063235 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [23:47:18] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1057952/3691/prometheus1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1057952 (owner: 10Dzahn) [23:48:16] (03CR) 10Dzahn: [V:03+1 C:03+2] prometheus::ops: switch ferm::service to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1057952 (owner: 10Dzahn) [23:52:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:55:42] (03CR) 10Cwhite: opensearch: unreach port and shards alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [23:55:59] !log prometheus - switched ferm::service to firewall::service (gerrit:1057952) - NOOP except /etc/ferm/conf.d/10_prometheus-web becomes /etc/ferm/conf.d/10_prometheus_web with identical rules [23:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:24] !log prometheus - puppet on prometheus hosts very slow - reason appears to be that /srv/prometheus is recursively managed by puppet but has ~ 20x more files than the default soft limit of 1000 [23:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed