[00:04:07] !log prometheus3003/prometheus1006 - are trying to use puppetserver1002 but get connection refused from puppetservre1001.eqiad.wmnet port 8140 - causing other puppet errors [00:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:58] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1063922 (owner: 10TrainBranchBot) [00:07:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:21:38] !log previous message about prometheus can be ignored - race condition that solved itself on next puppet run [00:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:22] (03CR) 10Dzahn: [V:03+1 C:03+2] "disabled puppet on all 9 hosts, re-enabled in steps, confirmed complete noop about the actual firewall rules - just one ferm config file g" [puppet] - 10https://gerrit.wikimedia.org/r/1057952 (owner: 10Dzahn) [00:32:47] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076004 (10Dzahn) - removed both users from Phabricator group WMF-NDA. done. - I have no rights to manage acl*securit... [00:33:29] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076008 (10Dzahn) [00:40:21] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076019 (10Dzahn) [00:40:37] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076020 (10Dzahn) [01:07:25] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076074 (10Dzahn) [01:08:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.19 [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1063924 (https://phabricator.wikimedia.org/T366964) [01:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.19 [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1063924 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [01:09:12] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10076077 (10Dzahn) - removed from Gerrit group wmde-mediawiki @KFrancis Hi, the users should be re... [01:38:26] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.19 [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1063924 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [01:48:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:53:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:53:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [01:54:06] jouncebot: nowandnext [01:54:06] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [01:54:06] In 0 hour(s) and 5 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0200) [01:54:21] darn, train [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0200) [02:04:26] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:31] train deploy preventing my 3AM deployments, tragic /s [02:11:21] speaking of deployments, anyone want to backport https://phabricator.wikimedia.org/T372444 sooner rather than later? Special:DeletedContributions is kinda important [02:16:18] o_o [02:16:29] (03PS1) 10Samtar: Fix DeletedContributions for user names containing spaces [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063926 (https://phabricator.wikimedia.org/T372444) [02:39:26] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:59:26] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0300) [03:01:23] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063927 (https://phabricator.wikimedia.org/T366964) [03:01:25] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063927 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [03:02:08] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063927 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [03:02:26] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 [03:02:29] T366964: 1.43.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T366964 [03:03:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:32:49] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:48:58] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 (duration: 46m 32s) [03:49:01] T366964: 1.43.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T366964 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0400) [04:00:57] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.16 (duration: 00m 56s) [04:52:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 [04:52:08] T372524: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T372524 [04:52:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67391 and previous config saved to /var/cache/conftool/dbconfig/20240820-045212-root.json [04:52:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 [04:52:47] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1062782 (https://phabricator.wikimedia.org/T372524) (owner: 10Gerrit maintenance bot) [04:58:05] (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1063930 [04:59:47] (03CR) 10Marostegui: [C:03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1063930 (owner: 10Marostegui) [05:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:16:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change [05:16:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change [05:17:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67392 and previous config saved to /var/cache/conftool/dbconfig/20240820-051726-marostegui.json [05:17:29] T372524: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T372524 [05:18:11] !log Starting s1 eqiad failover from db1184 to db1163 - T372524 [05:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T372524', diff saved to https://phabricator.wikimedia.org/P67393 and previous config saved to /var/cache/conftool/dbconfig/20240820-051821-root.json [05:18:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T372524', diff saved to https://phabricator.wikimedia.org/P67394 and previous config saved to /var/cache/conftool/dbconfig/20240820-051843-marostegui.json [05:19:07] (03PS2) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) [05:19:20] (03CR) 10Marostegui: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) (owner: 10Gerrit maintenance bot) [05:19:21] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1062783 (https://phabricator.wikimedia.org/T372524) (owner: 10Gerrit maintenance bot) [05:19:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1184 T372524', diff saved to https://phabricator.wikimedia.org/P67395 and previous config saved to /var/cache/conftool/dbconfig/20240820-051948-marostegui.json [05:22:23] !log Deploy schema change on s1 eqiad old master db1184 dbmaint T367856 [05:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:26] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [05:46:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0600) [06:00:04] marostegui, Amir1, and arnaudb: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:26] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:13:12] (03CR) 10AOkoth: sql_exporter: specify column for metric (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [06:26:29] (03CR) 10Ayounsi: os-updates-report: don't fail on hosts with no roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063728 (https://phabricator.wikimedia.org/T372728) (owner: 10Ayounsi) [06:27:23] (03PS2) 10Ayounsi: os-updates-report: don't fail on hosts with no roles [puppet] - 10https://gerrit.wikimedia.org/r/1063728 (https://phabricator.wikimedia.org/T372728) [06:32:01] (03CR) 10Ayounsi: [C:03+2] os-updates-report: don't fail on hosts with no roles [puppet] - 10https://gerrit.wikimedia.org/r/1063728 (https://phabricator.wikimedia.org/T372728) (owner: 10Ayounsi) [06:36:17] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [06:36:30] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 13s) [06:40:02] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T370754, transfer fresh wdqs-main journal) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1022.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [06:40:05] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [06:42:54] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [06:43:00] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 05s) [06:43:40] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts [06:43:41] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 18:00:00 on wdqs[2021-2023,2025].codfw.wmnet with reason: T364368 non-prod hosts [06:43:43] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [06:47:25] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2003.codfw.wmnet with reason: Update Netbox-next wheels - ayounsi@cumin1002 - T371890 [06:47:28] T371890: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 [06:48:48] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2003.codfw.wmnet with reason: Update Netbox-next wheels - ayounsi@cumin1002 - T371890 [06:50:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:51:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:51:19] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:57:01] (03PS3) 10Slyngshede: Permission approval/rejection [software/bitu] - 10https://gerrit.wikimedia.org/r/1058112 [06:59:59] (03PS1) 10Slyngshede: Token list, fix broken template. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063937 [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:02:49] (03PS1) 10Klausman: helmfile.d/ml-services: Switch enqiiki-draftquality to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063936 (https://phabricator.wikimedia.org/T363336) [07:03:10] (03CR) 10Slyngshede: [C:03+2] Token list, fix broken template. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063937 (owner: 10Slyngshede) [07:05:36] (03Merged) 10jenkins-bot: Token list, fix broken template. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063937 (owner: 10Slyngshede) [07:05:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:05:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:14:53] !log ayounsi@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Update Netbox wheels - ayounsi@cumin1002 - T371890 [07:14:56] T371890: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 [07:18:17] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox2003.codfw.wmnet,netbox1003.eqiad.wmnet with reason: Update Netbox wheels - ayounsi@cumin1002 - T371890 [07:19:57] (03PS1) 10Slyngshede: P:idm Enable 2FA management for testing [puppet] - 10https://gerrit.wikimedia.org/r/1063939 [07:20:25] (03CR) 10Ayounsi: [C:03+2] "Tested manually and works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/1062358 (https://phabricator.wikimedia.org/T371890) (owner: 10Ayounsi) [07:20:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:21:59] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3692/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063939 (owner: 10Slyngshede) [07:23:08] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3693/co" [puppet] - 10https://gerrit.wikimedia.org/r/1063939 (owner: 10Slyngshede) [07:25:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T370754, transfer fresh wdqs-main journal) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1022.eqiad.wmnet w/ force delete existing files, repooling neither afterwards [07:25:25] FIRING: [4x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:25:28] T370754: Import WDQS subgraphs to production nodes - https://phabricator.wikimedia.org/T370754 [07:27:34] RESOLVED: [4x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:30:25] RESOLVED: [4x] SystemdUnitFailed: wdqs-blazegraph.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:50] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idm Enable 2FA management for testing [puppet] - 10https://gerrit.wikimedia.org/r/1063939 (owner: 10Slyngshede) [07:45:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:50:11] (03CR) 10Dreamy Jazz: [C:03+1] Fix DeletedContributions for user names containing spaces [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063926 (https://phabricator.wikimedia.org/T372444) (owner: 10Samtar) [07:50:54] jouncebot: nowandnext [07:50:54] For the next 0 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0700) [07:50:54] In 0 hour(s) and 9 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0800) [07:51:23] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: enable ingress from the join pod to PG/5432 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063808 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [07:54:33] (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [07:56:16] (03CR) 10Filippo Giunchedi: "LGTM, and +Cole" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) (owner: 10Krinkle) [07:57:59] (03CR) 10Kevin Bazira: [C:03+1] helmfile.d/ml-services: Switch enqiiki-draftquality to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063936 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [07:58:37] (03CR) 10Klausman: [C:03+2] helmfile.d/ml-services: Switch enqiiki-draftquality to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063936 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [07:59:49] (03Merged) 10jenkins-bot: helmfile.d/ml-services: Switch enqiiki-draftquality to multiprocessing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063936 (https://phabricator.wikimedia.org/T363336) (owner: 10Klausman) [08:00:05] andre and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0800). [08:00:06] o/ [08:02:13] I will now start promoting group0 wikis to 1.43.0-wmf.19 [08:03:04] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063976 (https://phabricator.wikimedia.org/T366964) [08:03:05] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063976 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:03:48] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063976 (https://phabricator.wikimedia.org/T366964) (owner: 10TrainBranchBot) [08:04:07] !log klausman@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:07:09] (03PS1) 10Slyngshede: Permissions: Management command for closing expired requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063977 [08:07:40] (03PS3) 10Tiziano Fogli: opensearch: unreach port and shards alerts [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) [08:15:04] !log klausman@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:15:10] !log aklapper@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.19 refs T366964 [08:15:13] T366964: 1.43.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T366964 [08:19:43] (03CR) 10Tiziano Fogli: opensearch: unreach port and shards alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1062708 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [08:20:01] (03CR) 10Filippo Giunchedi: Create corto deployment/configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060516 (https://phabricator.wikimedia.org/T370789) (owner: 10BCornwall) [08:26:53] (03CR) 10Filippo Giunchedi: [C:04-1] "Thank you, what I meant is an host where the puppet agent ran and applied the changes, not only catalog compilation" [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [08:29:06] (03PS1) 10Ayounsi: Netbox reports: add script ID [puppet] - 10https://gerrit.wikimedia.org/r/1063982 [08:29:25] FIRING: SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:07] (03CR) 10Filippo Giunchedi: alert: Ensure alert1002 is the active alert host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [08:32:58] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063982 (owner: 10Ayounsi) [08:34:25] FIRING: [2x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:35:14] (03PS2) 10Ayounsi: Netbox reports: add script ID [puppet] - 10https://gerrit.wikimedia.org/r/1063982 [08:36:19] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063982 (owner: 10Ayounsi) [08:39:25] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:25] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:46] (03PS2) 10Slyngshede: Permissions: Management command for closing expired requests. [software/bitu] - 10https://gerrit.wikimedia.org/r/1063977 [08:49:25] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:25] RESOLVED: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:16] (03PS1) 10Tiziano Fogli: icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) [08:58:52] FIRING: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [08:59:25] FIRING: [3x] SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:52] RESOLVED: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [09:06:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:06:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:14:07] FIRING: [2x] GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [09:16:40] (03CR) 10Brouberol: [C:03+1] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:17:58] (03CR) 10Brouberol: [C:03+1] Disable wiping LVM signatures on cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1063824 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [09:18:16] (03CR) 10Stevemunene: [C:03+1] Disable wiping LVM signatures on cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1063824 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [09:19:07] RESOLVED: [2x] GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [09:19:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on wdqs1021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:08] (03CR) 10MVernon: [C:03+1] "This worked for cephosd systems :)" [puppet] - 10https://gerrit.wikimedia.org/r/1063824 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [09:21:21] (03CR) 10Btullis: [V:03+1 C:03+2] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [09:23:47] (03PS1) 10Cathal Mooney: Only validate vlan location in provision script when manually selected [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063989 (https://phabricator.wikimedia.org/T372654) [09:27:08] (03PS2) 10Cathal Mooney: Only validate vlan location in provision script when manually selected [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063989 (https://phabricator.wikimedia.org/T372654) [09:27:50] (03PS1) 10Ayounsi: profile::netbox::data: add type for Netbox::Device [puppet] - 10https://gerrit.wikimedia.org/r/1063990 (https://phabricator.wikimedia.org/T368513) [09:29:02] (03CR) 10Gmodena: [C:03+1] Remove the webrequest_frontend_rc0 gobblin job [puppet] - 10https://gerrit.wikimedia.org/r/1063820 (https://phabricator.wikimedia.org/T372456) (owner: 10Btullis) [09:31:51] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063990 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [09:34:05] jouncebot: nowandnext [09:34:05] For the next 0 hour(s) and 25 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T0800) [09:34:05] In 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1000) [09:36:11] !log Deploying calico configuration for codfw row c/d lsw - 1062728 [09:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:58] (03PS2) 10Tiziano Fogli: icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) [09:37:02] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [09:38:51] (03CR) 10Btullis: [C:03+2] Disable wiping LVM signatures on cephosd server reimages [puppet] - 10https://gerrit.wikimedia.org/r/1063824 (https://phabricator.wikimedia.org/T372783) (owner: 10Btullis) [09:39:29] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:39:55] !log cgoubert@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:40:12] !log cgoubert@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:40:41] !log cgoubert@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:41:11] !log cgoubert@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:41:30] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:42:06] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:42:26] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:42:36] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:44:54] (03CR) 10Clément Goubert: [C:03+1] "yep, sgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [09:45:09] (03CR) 10Clément Goubert: [C:03+1] mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [09:46:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:04] (03CR) 10Filippo Giunchedi: "LGTM, to be merged when the time comes" [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1000) [10:00:12] (03PS3) 10Tiziano Fogli: icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) [10:02:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1063982 (owner: 10Ayounsi) [10:02:46] (03CR) 10Cathal Mooney: [C:03+1] "My oversight not to remove this when the other was added. +1 thanks." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062694 (https://phabricator.wikimedia.org/T372461) (owner: 10Ayounsi) [10:05:31] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [10:24:05] (03CR) 10Stevemunene: [C:03+2] idp-test: Register airflow-test-k8s IDP services [puppet] - 10https://gerrit.wikimedia.org/r/1057799 (https://phabricator.wikimedia.org/T371209) (owner: 10Stevemunene) [10:25:49] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:47:42] (03PS4) 10Tiziano Fogli: icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) [10:47:57] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [10:48:54] (03PS3) 10Stevemunene: trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) [11:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:22:36] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 (10Clement_Goubert) 03NEW [11:25:26] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077064 (10Clement_Goubert) p:05Triage→03High [11:31:14] jouncebot: nowandnext [11:31:15] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [11:31:15] In 0 hour(s) and 28 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1200) [11:31:21] Going to a deploy [11:31:45] (03PS1) 10Dreamy Jazz: Allow ContributionsSpecialPage to accept usemodwiki IP addresses [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064002 (https://phabricator.wikimedia.org/T370413) [11:31:57] (03PS1) 10Dreamy Jazz: Allow ContributionsSpecialPage to accept usemodwiki IP addresses [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1064004 (https://phabricator.wikimedia.org/T370413) [11:33:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063926 (https://phabricator.wikimedia.org/T372444) (owner: 10Samtar) [11:33:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1064004 (https://phabricator.wikimedia.org/T370413) (owner: 10Dreamy Jazz) [11:33:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064002 (https://phabricator.wikimedia.org/T370413) (owner: 10Dreamy Jazz) [11:35:48] Ah thank you Dreamy_Jazz [11:35:56] Np [11:41:51] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077137 (10Clement_Goubert) From what I can gather the automation is there with the `--move-vlan` option to the reimage cookbook, I th... [11:52:52] (03PS1) 10Kevin Bazira: httpbb: add post deployment tests for the logo-detection endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1200) [12:06:23] (03Merged) 10jenkins-bot: Fix DeletedContributions for user names containing spaces [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1063926 (https://phabricator.wikimedia.org/T372444) (owner: 10Samtar) [12:07:29] (03Merged) 10jenkins-bot: Allow ContributionsSpecialPage to accept usemodwiki IP addresses [core] (wmf/1.43.0-wmf.18) - 10https://gerrit.wikimedia.org/r/1064004 (https://phabricator.wikimedia.org/T370413) (owner: 10Dreamy Jazz) [12:07:34] (03Merged) 10jenkins-bot: Allow ContributionsSpecialPage to accept usemodwiki IP addresses [core] (wmf/1.43.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1064002 (https://phabricator.wikimedia.org/T370413) (owner: 10Dreamy Jazz) [12:08:08] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1063926|Fix DeletedContributions for user names containing spaces (T372444)]], [[gerrit:1064004|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]], [[gerrit:1064002|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]] [12:08:29] T372444: Deleted contributions not showing for users with spaces in their user name - https://phabricator.wikimedia.org/T372444 [12:08:29] T370413: Contributions are not shown for usemod-style IP address octets that end with xxx - https://phabricator.wikimedia.org/T370413 [12:12:09] !log dreamyjazz@deploy1003 dreamyjazz, samtar: Backport for [[gerrit:1063926|Fix DeletedContributions for user names containing spaces (T372444)]], [[gerrit:1064004|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]], [[gerrit:1064002|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:15:16] !log dreamyjazz@deploy1003 dreamyjazz, samtar: Continuing with sync [12:19:47] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1063926|Fix DeletedContributions for user names containing spaces (T372444)]], [[gerrit:1064004|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]], [[gerrit:1064002|Allow ContributionsSpecialPage to accept usemodwiki IP addresses (T370413)]] (duration: 11m 38s) [12:19:52] T372444: Deleted contributions not showing for users with spaces in their user name - https://phabricator.wikimedia.org/T372444 [12:19:52] T370413: Contributions are not shown for usemod-style IP address octets that end with xxx - https://phabricator.wikimedia.org/T370413 [12:20:02] Finished my deploys (cc TheresNoTime ) [12:20:11] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10077212 (10ayounsi) > I need to check that the physical cabling changes are ok before we start Physical cabling is on the new switches... [12:20:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:22:05] Thanks again :) [12:22:09] (03CR) 10Ayounsi: [C:03+2] Provision script: remove additional IPs allocation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062694 (https://phabricator.wikimedia.org/T372461) (owner: 10Ayounsi) [12:23:54] (03Merged) 10jenkins-bot: Provision script: remove additional IPs allocation [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1062694 (https://phabricator.wikimedia.org/T372461) (owner: 10Ayounsi) [12:25:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:26:14] (03PS1) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [12:26:22] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [12:26:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [12:26:59] (03CR) 10Hnowlan: "As far as I understand it, there isn't a case where the apigw will pass the full path like this. I had to spend a bit of time refamiliaris" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [12:27:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:28:37] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [12:29:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [12:29:17] (03CR) 10Ayounsi: [C:03+2] Netbox reports: add script ID [puppet] - 10https://gerrit.wikimedia.org/r/1063982 (owner: 10Ayounsi) [12:30:18] (03CR) 10Klausman: "Is there a way to make the APIGW just not manipulate the URL path, but just forward the whole thing to the backend service? Or put another" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [12:31:59] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:33:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:34:53] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:36:22] (03CR) 10Ayounsi: [C:03+1] Only validate vlan location in provision script when manually selected [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063989 (https://phabricator.wikimedia.org/T372654) (owner: 10Cathal Mooney) [12:37:35] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [12:37:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:37:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:38:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:40:33] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:43:31] (03CR) 10Filippo Giunchedi: [C:03+1] profile::netbox::data: add type for Netbox::Device [puppet] - 10https://gerrit.wikimedia.org/r/1063990 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [12:45:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:47:58] (03CR) 10Ayounsi: [C:03+2] profile::netbox::data: add type for Netbox::Device [puppet] - 10https://gerrit.wikimedia.org/r/1063990 (https://phabricator.wikimedia.org/T368513) (owner: 10Ayounsi) [12:55:06] (03PS1) 10EoghanGaffney: apt-staging: Change gitlab package puller to use paths instead of IDs [puppet] - 10https://gerrit.wikimedia.org/r/1064018 [12:55:34] (03CR) 10EoghanGaffney: "Addressed in I449ba615cb1587cd7b62d3199849db28b808bc01" [puppet] - 10https://gerrit.wikimedia.org/r/1063015 (owner: 10EoghanGaffney) [12:55:54] (03PS1) 10Klausman: ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 [12:56:29] (03CR) 10AikoChou: [C:03+1] ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 (owner: 10Klausman) [12:56:31] (03PS1) 10Kevin Bazira: httpbb: add post deployment tests for the rec-api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1064021 (https://phabricator.wikimedia.org/T371465) [12:57:09] (03PS2) 10Klausman: ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 [12:57:33] (03PS3) 10Klausman: ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 [12:57:39] FIRING: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:51] (03CR) 10Klausman: [C:03+2] ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 (owner: 10Klausman) [12:58:48] (03Merged) 10jenkins-bot: ml-services: bump memory for readability isvc in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064019 (owner: 10Klausman) [12:59:14] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:02:25] (03PS2) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [13:05:10] (03PS3) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [13:05:40] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s7 [13:05:45] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1014.eqiad.wmnet,service=s2 [13:06:12] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424 [13:06:25] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1014.eqiad.wmnet with reason: Reimaging clouddb1014 T365424 [13:06:27] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:07:39] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-dse&var-latency_percentile=0.95&var-verb=POST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [13:12:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1299.eqiad.wmnet with OS bullseye [13:12:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10077382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bullseye... [13:12:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10077383 (10Jclark-ctr) [13:13:27] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064023 [13:13:34] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064024 [13:15:30] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1014.eqiad.wmnet with OS bookworm [13:15:49] FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:17:37] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on backup2003 - https://phabricator.wikimedia.org/T372698#10077398 (10Jhancock.wm) a:03Jhancock.wm [13:20:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:21:10] (03CR) 10Jelto: [C:03+1] "looks good, thank you. A few comments and logs need a update as well" [puppet] - 10https://gerrit.wikimedia.org/r/1064018 (owner: 10EoghanGaffney) [13:27:24] (03PS1) 10Btullis: Add ceph client config data for radosgw clients [puppet] - 10https://gerrit.wikimedia.org/r/1064026 (https://phabricator.wikimedia.org/T330152) [13:28:03] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1014.eqiad.wmnet with reason: host reimage [13:28:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3694/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064026 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [13:30:53] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3695/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064026 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [13:31:40] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1014.eqiad.wmnet with reason: host reimage [13:32:54] (03CR) 10Btullis: [V:03+1 C:03+2] Add ceph client config data for radosgw clients [puppet] - 10https://gerrit.wikimedia.org/r/1064026 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [13:35:23] (03CR) 10Brouberol: [C:03+1] trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [13:35:34] (03PS4) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [13:42:34] (03PS1) 10CDanis: add kitty-terminfo to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1064028 [13:44:43] (03CR) 10JHathaway: [C:03+1] add kitty-terminfo to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1064028 (owner: 10CDanis) [13:46:16] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-07-19-164024 to 2024-08-13-135124 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064031 (https://phabricator.wikimedia.org/T296679) [13:46:31] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-07-23-225548 to 2024-08-16-153209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064032 (https://phabricator.wikimedia.org/T57876) [13:46:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:48:26] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-07-19-164024 to 2024-08-13-135124 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064031 (https://phabricator.wikimedia.org/T296679) (owner: 10Jforrester) [13:49:20] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-07-19-164024 to 2024-08-13-135124 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064031 (https://phabricator.wikimedia.org/T296679) (owner: 10Jforrester) [13:50:00] (03PS1) 10Arnaudb: mariadb: temporary testing environment [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) [13:50:10] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:51:30] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:53:25] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [13:54:05] (03CR) 10Marostegui: [C:04-1] mariadb: temporary testing environment (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [13:54:56] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [13:54:59] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [13:55:45] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [13:55:46] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10077596 (10Jclark-ctr) a:03Jclark-ctr [13:56:14] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-07-23-225548 to 2024-08-16-153209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064032 (https://phabricator.wikimedia.org/T57876) (owner: 10Jforrester) [13:57:10] (03CR) 10Ladsgroup: "The idea was to set it up in cloud services See T356053 and T343341" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [13:57:15] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-07-23-225548 to 2024-08-16-153209 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064032 (https://phabricator.wikimedia.org/T57876) (owner: 10Jforrester) [13:58:55] (03PS2) 10Arnaudb: mariadb: temporary testing environment [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) [13:59:01] (03CR) 10Arnaudb: mariadb: temporary testing environment (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [13:59:20] (03CR) 10AOkoth: "I'm relying on a simple two node setup on Vagrant locally just to check if the generated YAML is okay." [puppet] - 10https://gerrit.wikimedia.org/r/1063766 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [13:59:27] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1014.eqiad.wmnet with OS bookworm [13:59:30] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:59:48] (03CR) 10CDanis: [C:03+2] add kitty-terminfo to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/1064028 (owner: 10CDanis) [13:59:53] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s2 [13:59:58] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1014.eqiad.wmnet,service=s7 [14:02:28] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:02:44] !log mforns@deploy1003 Started deploy [airflow-dags/analytics@c202679]: (no justification provided) [14:02:46] (03CR) 10Marostegui: mariadb: temporary testing environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:02:53] !log jforrester@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:03:36] !log mforns@deploy1003 Finished deploy [airflow-dags/analytics@c202679]: (no justification provided) (duration: 00m 51s) [14:04:32] (03CR) 10Marostegui: [C:04-1] "You should probably describe what your plan is to use these hosts, as it may play a role on this patch" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:04:48] (03CR) 10Arnaudb: mariadb: temporary testing environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:05:32] !log jforrester@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:05:35] !log jforrester@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:05:42] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1 [14:05:44] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [14:05:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:05:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1016.eqiad.wmnet with OS bookworm [14:05:56] (03CR) 10Marostegui: [C:04-1] mariadb: temporary testing environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:05:58] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10077652 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1016.eqiad.wmnet with OS bookworm [14:06:12] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Reimaging clouddb1013 T365424 [14:06:25] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1013.eqiad.wmnet with reason: Reimaging clouddb1013 T365424 [14:06:43] !log jforrester@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:07:05] (03CR) 10Hnowlan: "It's a hack, so please comment to explain it if you go with it, but in the same setup, setting `full_path_trim: "/"` and requesting `curl " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:08:41] (03PS1) 10Jforrester: wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-16-153209 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064035 (https://phabricator.wikimedia.org/T371837) [14:09:05] (03PS1) 10Clément Goubert: kubernetes: Rename mw2291 to wikikube-worker2040 [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) [14:10:22] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1013.eqiad.wmnet with OS bookworm [14:10:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bookworm [14:10:25] (03CR) 10Clément Goubert: [C:04-1] kubernetes: Rename mw2291 to wikikube-worker2040 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [14:10:31] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10077677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm [14:10:38] (03PS3) 10Klausman: services/apigw: drop prefix trim for recommendation-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) [14:10:51] (03PS5) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [14:11:09] (03CR) 10Klausman: "Thank you! I've added a comment that (hopefully) covers the scope and intent of the hack." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:13:01] (03CR) 10EoghanGaffney: apt-staging: Change gitlab package puller to use paths instead of IDs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1064018 (owner: 10EoghanGaffney) [14:13:07] (03PS2) 10EoghanGaffney: apt-staging: Change gitlab package puller to use paths instead of IDs [puppet] - 10https://gerrit.wikimedia.org/r/1064018 [14:13:19] (03CR) 10Arnaudb: "I'll replicate s1 db on pc1017 and pc2017 and then run the dc switchover cookbook operations between the two" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:13:57] (03PS6) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [14:14:30] (03CR) 10Hnowlan: [C:03+1] "Just to note - if it's decided to go with this approach for all services, we could just add this behaviour as a boolean and deprecate usin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:14:33] (03CR) 10Klausman: httpbb: add post deployment tests for the logo-detection endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064005 (https://phabricator.wikimedia.org/T370757) (owner: 10Kevin Bazira) [14:14:40] (03PS7) 10Jgiannelos: changeprop: Enable PCS pregeneration without restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064013 (https://phabricator.wikimedia.org/T319365) [14:16:10] hnowlan: Did you set thumbor's log_level to debug in deployment-charts with a local root git over-ride? Could you remove so I can deploy? :-) [14:16:13] (03PS2) 10Clément Goubert: kubernetes: Rename mw2291 to wikikube-worker2040 [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) [14:16:44] (03CR) 10Clément Goubert: kubernetes: Rename mw2291 to wikikube-worker2040 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [14:16:47] (03PS1) 10Brouberol: airflow: automatically connect to PGBouncer instead of PG itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064039 (https://phabricator.wikimedia.org/T372286) [14:17:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdc) failed on ms-be1058 - https://phabricator.wikimedia.org/T372207#10077717 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF As of right now, since there are no replacements. I will be closing this ticket. If a replacement is needed, feel... [14:18:03] James_F: I did, apologies. reverted [14:18:08] Thanks! [14:18:17] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-16-153209 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064035 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [14:19:15] (03Merged) 10jenkins-bot: wikifunctions: Upgrade staging evaluators from 2024-08-16-153209 to 2024-08-16-153209 with new WASM pool code [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064035 (https://phabricator.wikimedia.org/T371837) (owner: 10Jforrester) [14:22:19] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:22:37] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:22:42] (03CR) 10Cathal Mooney: [C:03+2] Only validate vlan location in provision script when manually selected [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063989 (https://phabricator.wikimedia.org/T372654) (owner: 10Cathal Mooney) [14:22:49] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1013.eqiad.wmnet with reason: host reimage [14:22:49] (03CR) 10Marostegui: [C:04-1] "I'm not sure if that would work. have you checked the cookbook code? also S1 is enwiki not wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:23:11] (03CR) 10Klausman: [C:03+2] services/apigw: drop prefix trim for recommendation-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:24:23] (03Merged) 10jenkins-bot: services/apigw: drop prefix trim for recommendation-api [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063805 (https://phabricator.wikimedia.org/T347263) (owner: 10Klausman) [14:24:33] (03CR) 10Bking: [C:03+1] airflow: automatically connect to PGBouncer instead of PG itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064039 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:24:58] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1013.eqiad.wmnet with reason: host reimage [14:24:59] (03Merged) 10jenkins-bot: Only validate vlan location in provision script when manually selected [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1063989 (https://phabricator.wikimedia.org/T372654) (owner: 10Cathal Mooney) [14:25:09] (03CR) 10Hnowlan: [C:03+1] "lgtm mostly - we could/should clean up the old hostnames in preseed.yaml also." [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [14:26:35] (03CR) 10Brouberol: [C:03+2] airflow: automatically connect to PGBouncer instead of PG itself [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064039 (https://phabricator.wikimedia.org/T372286) (owner: 10Brouberol) [14:26:39] (03PS3) 10Clément Goubert: kubernetes: Rename mw2291 to wikikube-worker2040 [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) [14:26:51] (03PS1) 10Btullis: Enable the correct service name for radosgw [puppet] - 10https://gerrit.wikimedia.org/r/1064041 (https://phabricator.wikimedia.org/T330152) [14:27:59] (03CR) 10Hnowlan: [C:03+1] kubernetes: Rename mw2291 to wikikube-worker2040 [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [14:28:03] hnowlan: All clear now if you need it back. [14:28:31] (03PS2) 10Btullis: Enable the correct service name for radosgw [puppet] - 10https://gerrit.wikimedia.org/r/1064041 (https://phabricator.wikimedia.org/T330152) [14:28:36] James_F: thanks, should be fine [14:29:17] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3696/co" [puppet] - 10https://gerrit.wikimedia.org/r/1064041 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [14:29:46] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:30:12] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [14:31:24] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077810 (10VRiley-WMF) @ayounsi I've checked the device and there doesn't seem to be any failure notifications (Physically anyway). Would it be possible to open up a RMA or Su... [14:31:44] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:32:07] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:33:30] (03CR) 10Ladsgroup: "are the new hosts set up? e.g. T369661 says RAID is not set up yet? (I might be missing something). On top, we need pc expansion in rotati" [puppet] - 10https://gerrit.wikimedia.org/r/1064033 (https://phabricator.wikimedia.org/T372893) (owner: 10Arnaudb) [14:36:11] (03CR) 10Jelto: [C:03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/1064018 (owner: 10EoghanGaffney) [14:36:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 23.06% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:37:22] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:37:29] (03PS1) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [14:37:45] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:38:10] !log klausman@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [14:38:11] Hmm [14:38:12] (03CR) 10Brennen Bearnes: [C:03+2] Update translations from translatewiki [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1063306 (owner: 10Pppery) [14:38:32] (03CR) 10Btullis: [V:03+1 C:03+2] Enable the correct service name for radosgw [puppet] - 10https://gerrit.wikimedia.org/r/1064041 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [14:38:38] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Update translations from translatewiki [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1063306 (owner: 10Pppery) [14:38:46] (03CR) 10Brouberol: "I think this one can be abandoned to the profit of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1063848" [puppet] - 10https://gerrit.wikimedia.org/r/1057830 (https://phabricator.wikimedia.org/T371210) (owner: 10Stevemunene) [14:38:49] Bunch of RuntimeException: Could not acquire lock for page ID [14:39:26] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:41:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 22.63% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:42:19] (03PS2) 10Ssingh: sre.dns.admin: add guardrails for depool of sites/resources [cookbooks] - 10https://gerrit.wikimedia.org/r/1064042 [14:42:49] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:43:10] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:43:25] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077862 (10cmooney) >>! In T372781#10077810, @VRiley-WMF wrote: > @ayounsi I've checked the device and there doesn't seem to be any failure notifications (Physically anyway).... [14:44:49] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10netops: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10077866 (10VRiley-WMF) Sounds like a plan. Thank you! I will be at the ready. [14:45:30] (03PS1) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [14:45:42] !log Depooling mw2291.codfw.wmnet for rename and ip renumbering - T372878 [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878 [14:46:12] ah yes, merging the cookbook BEFORE trying to run it would be a good idea [14:46:36] (03CR) 10Clément Goubert: [C:03+2] sre.k8s: Add pool-depool-node cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1059045 (owner: 10Clément Goubert) [14:47:43] (03CR) 10Stevemunene: [C:03+2] trafficserver: add airflow-test-k8s discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1063848 (https://phabricator.wikimedia.org/T368760) (owner: 10Stevemunene) [14:49:10] (03PS1) 10AikoChou: ml-services: add old readability isvc in staging for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064046 [14:49:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2035.codfw.wmnet with OS bookworm [14:50:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10077901 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2035.codfw.wmnet... [14:51:45] (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [14:53:08] (03CR) 10Klausman: [C:03+1] ml-services: add old readability isvc in staging for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064046 (owner: 10AikoChou) [14:54:04] (03CR) 10AikoChou: [C:03+2] ml-services: add old readability isvc in staging for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064046 (owner: 10AikoChou) [14:55:11] (03Merged) 10jenkins-bot: ml-services: add old readability isvc in staging for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064046 (owner: 10AikoChou) [14:55:55] (03PS2) 10Ssingh: wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 [14:56:27] (03CR) 10Ssingh: "Added a test; CI failure seems unrelated so I will just leave it as it is and come back to it later after the review." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [15:00:00] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [15:00:05] eoghan, jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1500). [15:00:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:31] (03Merged) 10jenkins-bot: sre.k8s: Add pool-depool-node cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1059045 (owner: 10Clément Goubert) [15:02:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs2012: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370862#10077948 (10cmooney) 05Open→03Resolved >>! In T370862#10035781, @Papaul wrote: > @cmooney links removed. Yo... [15:02:49] (03CR) 10CI reject: [V:04-1] wmflib/constants: add US_DATACENTERS [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1064045 (owner: 10Ssingh) [15:03:02] 10ops-drmrs: determine cable ID for CRT-008647 - https://phabricator.wikimedia.org/T369951#10077956 (10RobH) CS1873954 opened > Support, > > We need to determine the cable ID on the patch cable installed for CRT-008647 via order ID 1298-1749416. > > We requested this information during installation, and nev... [15:03:16] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:43] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw2291.codfw.wmnet [15:04:02] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:04:15] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:04:19] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw2291.codfw.wmnet [15:04:35] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:04:37] (03CR) 10Clément Goubert: [C:03+2] kubernetes: Rename mw2291 to wikikube-worker2040 [puppet] - 10https://gerrit.wikimedia.org/r/1064036 (https://phabricator.wikimedia.org/T372878) (owner: 10Clément Goubert) [15:04:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab.wmfusercontent.org with reason: Phabricator/Phorge update [15:04:58] !log brennen@deploy1003 Started deploy [phabricator/deployment@89f5014]: deploy phab2002 for T372898 [15:05:02] T372898: Deploy Phabricator/Phorge 2024-08-20 - https://phabricator.wikimedia.org/T372898 [15:05:27] (03PS1) 10Slyngshede: D:apereo_cas::service Make exposed attributes configurable. [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) [15:05:32] !log brennen@deploy1003 Finished deploy [phabricator/deployment@89f5014]: deploy phab2002 for T372898 (duration: 00m 33s) [15:07:15] 10ops-drmrs, 10ops-eqiad, 06DC-Ops: Clean up old drmrs-eqiad circuit CRT-009240 - https://phabricator.wikimedia.org/T370023#10077973 (10RobH) p:05Triage→03Medium [15:07:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.rename from mw2291 to wikikube-worker2040 [15:07:19] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3697/console" [puppet] - 10https://gerrit.wikimedia.org/r/1064050 (https://phabricator.wikimedia.org/T369205) (owner: 10Slyngshede) [15:07:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:08:12] wtf is sidekiq anyway [15:09:15] hnowlan: something gitlab [15:09:17] sidekiq is a gitlab component [15:09:17] !log brennen@deploy1003 Started deploy [phabricator/deployment@89f5014]: deploy phab2002 for T372898 (test redux) [15:09:27] the background job processor according to my very cursory search [15:09:31] https://en.wikipedia.org/wiki/Sidekiq ;) [15:10:39] !log brennen@deploy1003 Finished deploy [phabricator/deployment@89f5014]: deploy phab2002 for T372898 (test redux) (duration: 01m 22s) [15:10:42] T372898: Deploy Phabricator/Phorge 2024-08-20 - https://phabricator.wikimedia.org/T372898 [15:11:19] !log deploy pfw policy update 1724083328 - T372792 [15:11:20] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2291 to wikikube-worker2040 - cgoubert@cumin1002" [15:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:11:46] !log brennen@deploy1003 Started deploy [phabricator/deployment@89f5014]: deploy phab1004 for T372898 [15:12:17] !log brennen@deploy1003 Finished deploy [phabricator/deployment@89f5014]: deploy phab1004 for T372898 (duration: 00m 31s) [15:12:44] (03PS3) 10Krinkle: CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) [15:12:53] (03CR) 10Krinkle: CommonSettings: Rename unregistered wgStatsHost to local "statsHost" var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063913 (https://phabricator.wikimedia.org/T365265) (owner: 10Krinkle) [15:13:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:13:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2291 to wikikube-worker2040 - cgoubert@cumin1002" [15:13:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:33] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2040 [15:13:51] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2040 [15:14:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2291 to wikikube-worker2040 [15:14:39] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by cgoubert@cumin1002 from mw2291 to... [15:15:07] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2040.codfw.wmnet with OS bullseye [15:15:17] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078010 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [15:15:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [15:15:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:00] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host [15:17:01] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2040.codfw.wmnet with OS bullseye [15:17:14] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:21:09] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:21:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:22:36] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:23:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:23:11] (03PS1) 10Dzahn: gerrit: re-enable throttling over 1000 packets per minute [puppet] - 10https://gerrit.wikimedia.org/r/1064052 (https://phabricator.wikimedia.org/T365259) [15:24:06] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1013.eqiad.wmnet with OS bookworm [15:24:10] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1064052 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:24:28] (03CR) 10Dzahn: [C:03+2] gerrit: re-enable throttling over 1000 packets per minute [puppet] - 10https://gerrit.wikimedia.org/r/1064052 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [15:24:35] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [15:24:39] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [15:26:06] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1016.eqiad.wmnet with OS bookworm [15:26:20] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10078067 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1016.eqiad.wmnet with OS bookworm executed with errors: - an... [15:26:58] 10ops-drmrs: determine cable ID for CRT-008647 - https://phabricator.wikimedia.org/T369951#10078071 (10RobH) They'll be checking the cable ID for us on 22nd (have to do 24 hour + notice or pay expedite fee) [15:28:23] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2040.codfw.wmnet with OS bullseye [15:28:32] !log cgoubert@cumin1002 START - Cookbook sre.hosts.move-vlan for host [15:28:38] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078072 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [15:28:44] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [15:34:32] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10078095 (10jhathaway) [15:34:40] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for HCoplin - https://phabricator.wikimedia.org/T372907 (10HCoplin-WMF) 03NEW [15:34:48] !log cgoubert@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2040 - cgoubert@cumin1002" [15:34:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2040 - cgoubert@cumin1002" [15:34:53] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:53] !log cgoubert@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2040.codfw.wmnet 161.0.192.10.in-addr.arpa 1.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:34:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2040.codfw.wmnet 161.0.192.10.in-addr.arpa 1.6.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [15:34:57] !log cgoubert@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2040 [15:36:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2040 [15:36:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host [15:45:52] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Enabling Action API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064056 [15:46:38] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Enabling Action API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064056 (owner: 10Santiago Faci) [15:47:33] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Enabling Action API communication [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064056 (owner: 10Santiago Faci) [15:49:56] (03PS14) 10CDobbins: prometheus: add script to check TCP MSS clamping value [puppet] - 10https://gerrit.wikimedia.org/r/1062457 (https://phabricator.wikimedia.org/T367204) [15:51:36] (03CR) 10Ssingh: [C:03+1] "Looks good, compared against Netbox. (This probably needs a rebase if we look at interfaces.yaml since our warning there that we recently " [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) (owner: 10Cathal Mooney) [15:51:43] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [15:51:56] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [15:52:50] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [15:52:52] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [15:53:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage [15:55:31] (03PS1) 10Aaron Schulz: Set monolog level for DeferredUpdates to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 [15:56:24] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2040.codfw.wmnet with reason: host reimage [15:57:15] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Enabling MW API as a listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064058 [15:58:14] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configurator: Enabling MW API as a listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064058 (owner: 10Santiago Faci) [15:59:10] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Enabling MW API as a listener [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064058 (owner: 10Santiago Faci) [15:59:19] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:08] (03PS6) 10Cathal Mooney: lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) [16:05:05] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [16:05:22] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [16:07:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-sd2004 to codfw - jhancock@cumin2002" [16:08:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding logging-sd2004 to codfw - jhancock@cumin2002" [16:08:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:10:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-sd2004.mgmt.codfw.wmnet with reboot policy FORCED [16:15:55] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2040.codfw.wmnet with OS bullseye [16:16:06] 06SRE, 06Infrastructure-Foundations, 10netops, 06serviceops, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10078336 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:17:26] (03PS1) 10CDanis: admin: /home/cdanis dotfiles tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1064060 [16:17:50] (03CR) 10CDanis: [C:03+2] admin: /home/cdanis dotfiles tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1064060 (owner: 10CDanis) [16:21:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-sd2004.mgmt.codfw.wmnet with reboot policy FORCED [16:22:47] !log begginng work to reimage lvs2014 onto per-rack vlan in codfw rack D2 and move to new switch T370897 [16:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] T370897: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897 [16:23:31] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on lsw1-d2-codfw.mgmt with reason: move lvs2014 from asw to lsw [16:23:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lsw1-d2-codfw.mgmt with reason: move lvs2014 from asw to lsw [16:24:00] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on lvs2014.codfw.wmnet with reason: move lvs2014 from asw to lsw [16:24:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on lvs2014.codfw.wmnet with reason: move lvs2014 from asw to lsw [16:25:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078401 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=24f68f00-c864-474e-a3e6-c044aab86afa) set by... [16:25:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078403 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=612388e5-b8df-408f-81be-6f237cee6e7c) set by... [16:26:36] !log disabling BGP to PyBal on lvs2014 in preparation for move to new switch T370897 [16:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:00] !log LDAP - removed htriedman from wmf group, added htriedman to nda group (T371644) [16:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:03] T371644: Requested offboarding-to-volunteer of HTriedman // Transfer ownership of SpinachBot from HTriedman (WMF) to HTriedman - https://phabricator.wikimedia.org/T371644 [16:28:07] (03PS1) 10Ayounsi: Netbox-hiera: add device role to mgmt_hosts (try 2) [cookbooks] - 10https://gerrit.wikimedia.org/r/1064061 (https://phabricator.wikimedia.org/T368513) [16:28:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-sd2004'] [16:30:32] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on backup2003 - https://phabricator.wikimedia.org/T372698#10078455 (10Jhancock.wm) I have a drive on hand to replace this one. I can schedule a time any day this week between 8am and 4pm CDT (1300-2100 UTC). [16:33:41] (03PS2) 10MusikAnimal: [beta] Enable CodeMirrorRTL on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037658 (https://phabricator.wikimedia.org/T170001) [16:35:58] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-sd2004'] [16:38:43] !log Running homer 'lsw1-a3-codfw*' commit 'T351074' [16:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:56] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [16:39:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host logging-sd2004.codfw.wmnet with OS bookworm [16:39:31] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10078488 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host logging-sd2004.codfw.wmnet with OS bookworm [16:40:44] !log adding vlans to lsw1-d2-codfw for lvs2014 T370897 [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:52] T370897: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897 [16:41:30] !log Pooling wikikube-worker2040.codfw.wmnet - T351074 [16:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:36] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2040.codfw.wmnet [16:41:36] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2040.codfw.wmnet [16:43:06] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:43:29] (03CR) 10Cathal Mooney: [C:03+2] lvs2014: move A and B vlans to primary link and add new C and D vlans [puppet] - 10https://gerrit.wikimedia.org/r/1056550 (https://phabricator.wikimedia.org/T370897) (owner: 10Cathal Mooney) [16:44:46] (03PS2) 10Aaron Schulz: Set monolog level for "DeferredUpdates" to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 [16:45:28] (03PS3) 10Aaron Schulz: Set monolog level for "DeferredUpdates" to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 [16:46:21] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916 (10Clement_Goubert) 03NEW [16:49:57] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2014 - cmooney@cumin1002" [16:50:01] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for lvs2014 - cmooney@cumin1002" [16:50:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:51:20] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache lvs2013.codfw.wmnet on all recursors [16:51:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lvs2013.codfw.wmnet on all recursors [16:52:31] (03PS4) 10Aaron Schulz: Set monolog level for "DeferredUpdates" to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 [16:53:03] (03PS5) 10Aaron Schulz: Set monolog level for "DeferredUpdates" to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 [16:55:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logging-sd2004.codfw.wmnet with reason: host reimage [16:56:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [16:56:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078582 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2014.cod... [16:57:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logging-sd2004.codfw.wmnet with reason: host reimage [16:58:11] (03CR) 10Scott French: [C:03+2] mw-api-int: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [16:58:15] (03PS1) 10Btullis: Add TLS support to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [16:58:39] (03CR) 10CI reject: [V:04-1] Add TLS support to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:59:11] (03Merged) 10jenkins-bot: mw-api-int: pilot bookworm statsd exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063033 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:02:03] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:02:25] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:06:29] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:06:51] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:14:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [17:16:28] !log removing config for ssw1-a8-codfw link to lvs2014 T370897 [17:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:34] T370897: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897 [17:20:18] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "recover wdqs2024 from failed status T372919 - bking@cumin2002" [17:20:21] T372919: Bring wqds2024 back into service - https://phabricator.wikimedia.org/T372919 [17:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:24:54] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10078697 (10KFrancis) Thanks, @Dzahn, the spreadsheet has been updated. [17:25:41] (03PS2) 10Btullis: Add TLS support to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) [17:25:46] (03CR) 10Scott French: [C:03+2] mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:26:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10078703 (10Ottomata) I can approve for `analytics-privatedata-users`. Approved! [17:27:19] (03Merged) 10jenkins-bot: mediawiki: upgrade all statsd exporters to bookworm image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1063034 (https://phabricator.wikimedia.org/T368366) (owner: 10Scott French) [17:28:47] (03CR) 10CI reject: [V:04-1] Add TLS support to the radosgw services on the DPE ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/1064063 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:30:05] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:30:25] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:30:26] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:30:54] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:30:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-misc: apply [17:31:07] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [17:31:08] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [17:31:30] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [17:31:31] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:31:53] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:31:55] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:32:04] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:32:47] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "recover wdqs2024 from failed status T372919 - bking@cumin2002" [17:32:50] FYI, as was the case yesterday, this is only touching the statsd exporter ^^ [17:32:52] T372919: Bring wqds2024 back into service - https://phabricator.wikimedia.org/T372919 [17:37:07] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [17:37:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:40:42] (03CR) 10AOkoth: vrts: run install script on new server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [17:43:12] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:43:32] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:43:33] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:43:34] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [17:43:51] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:43:52] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [17:44:07] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [17:44:08] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [17:44:24] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [17:44:25] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:44:43] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:44:44] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:44:56] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:45:14] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1017.eqiad.wmnet with OS bookworm [17:45:23] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10078793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm executed with errors: - an... [17:46:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:45] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [17:47:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [17:50:37] (03PS1) 10Dzahn: nftables_throttling: set a default value for burst parameter [puppet] - 10https://gerrit.wikimedia.org/r/1064065 [17:51:04] !log mediawiki statsd exporter deployments upgraded to bookworm-based image - T368366 [17:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:07] T368366: Upgrade K8s docker images running in Wikimedia production on Buster to either Bullseye or Bookworm - https://phabricator.wikimedia.org/T368366 [18:00:04] andre and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T1800). [18:00:22] jouncebot: nothing to do, sorry [18:03:49] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2014.codfw.wmnet with OS bullseye [18:04:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10078855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2014.codfw.w... [18:08:33] (03PS2) 10Andrea Denisse: alert: Add the alert[12]002 hosts to puppet realm [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) [18:08:33] (03PS4) 10Andrea Denisse: alert: Ensure alert1002 is the active alert host [puppet] - 10https://gerrit.wikimedia.org/r/1063075 (https://phabricator.wikimedia.org/T372418) [18:08:34] (03PS3) 10Andrea Denisse: alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) [18:09:17] (03CR) 10CI reject: [V:04-1] alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [18:12:30] (03PS4) 10Andrea Denisse: alert: Remove the alert[12]002 hosts as alertmanagers [puppet] - 10https://gerrit.wikimedia.org/r/1063234 (https://phabricator.wikimedia.org/T372607) [18:21:34] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 07Wikimedia-production-error: Cannot move Commons File:Dhruve_Sehgal_in_2021.png - https://phabricator.wikimedia.org/T372924#10078896 (10Bugreporter) [18:24:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [18:26:07] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-presto1017.eqiad.wmnet with OS bookworm [18:26:14] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10078915 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-presto1017.eqiad.wmnet with OS bookworm [18:29:01] (03CR) 10Krinkle: [C:03+1] Set monolog level for "DeferredUpdates" to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064057 (owner: 10Aaron Schulz) [18:42:38] 06SRE, 06Infrastructure-Foundations, 10netops: PuppetDB import failing for lvs2014 - https://phabricator.wikimedia.org/T372931 (10cmooney) 03NEW [18:43:35] (03CR) 10CDanis: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 (owner: 10CDanis) [18:43:39] (03CR) 10CDanis: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 (owner: 10CDanis) [18:43:42] (03CR) 10CDanis: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/830259 (https://phabricator.wikimedia.org/T317159) (owner: 10CDanis) [18:46:04] 06SRE, 06Infrastructure-Foundations, 10netops: PuppetDB import failing for lvs2014 - https://phabricator.wikimedia.org/T372931#10078999 (10ssingh) As another data point, we most certainly have not reimaged any LVS host //after// the Netbox migration was finished. So yeah, it might be related to that. [18:51:32] (03CR) 10RLazarus: [C:03+1] refactor value of api_base_url to support reporting API [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 (owner: 10CDanis) [18:52:35] (03CR) 10CDanis: [C:03+2] "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 (owner: 10CDanis) [18:54:36] (03Merged) 10jenkins-bot: refactor value of api_base_url to support reporting API [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 (owner: 10CDanis) [18:58:16] (03CR) 10RLazarus: [C:03+1] "Style only, feel free to merge without another round" [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 (owner: 10CDanis) [19:00:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Drop PSON support - https://phabricator.wikimedia.org/T372667#10079020 (10jhathaway) [19:03:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10079026 (10jhathaway) [19:09:23] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#10079038 (10jhathaway) [19:09:42] (03PS2) 10CDanis: refactor out incident parsing for reuse [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 [19:09:42] (03PS1) 10CDanis: add py11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064073 [19:10:03] (03CR) 10CDanis: [C:03+2] add py11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064073 (owner: 10CDanis) [19:10:57] (03CR) 10CDanis: [C:03+2] refactor out incident parsing for reuse (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 (owner: 10CDanis) [19:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:11:49] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10079042 (10Dzahn) [19:11:50] (03CR) 10CI reject: [V:04-1] refactor out incident parsing for reuse [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 (owner: 10CDanis) [19:12:00] (03Merged) 10jenkins-bot: add py11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064073 (owner: 10CDanis) [19:14:04] (03PS3) 10CDanis: refactor out incident parsing for reuse [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 [19:14:16] (03CR) 10CDanis: [C:03+2] refactor out incident parsing for reuse [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 (owner: 10CDanis) [19:14:21] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10079043 (10Dzahn) [19:15:45] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10079045 (10Dzahn) The actual LDAP UIDs are **gtzatchkova** (and ALSO: **guergana** but that isn't... [19:16:06] !log restarting netbox service on netbox1003 to update script [19:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10079050 (10Dzahn) [19:22:47] 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10LDAP-Access-Requests, 06Security-Team: Offboard Guergana Tzatchkova (WMDE) and Frederik Ring from WMF systems - https://phabricator.wikimedia.org/T372767#10079066 (10Dzahn) @jhathaway Could you take over from here as part of clinic duty and since you ar... [19:22:51] (03CR) 10Andrea Denisse: alert: Add the alert[12]002 hosts to puppet realm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1063063 (https://phabricator.wikimedia.org/T372418) (owner: 10Andrea Denisse) [19:24:35] (03PS1) 10Cathal Mooney: Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) [19:26:23] (03CR) 10CI reject: [V:04-1] Fix puppet import so it doesn't fail if parent prefix has no role [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1064076 (https://phabricator.wikimedia.org/T372931) (owner: 10Cathal Mooney) [19:29:03] (03PS1) 10CDanis: Drop support for Python <3.11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064077 [19:29:55] (03PS1) 10CDanis: victorops cli: COMMAND is required [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064080 [19:30:02] (03CR) 10CDanis: [C:03+2] Drop support for Python <3.11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064077 (owner: 10CDanis) [19:30:08] (03CR) 10CDanis: [C:03+2] victorops cli: COMMAND is required [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064080 (owner: 10CDanis) [19:34:09] (03Merged) 10jenkins-bot: Drop support for Python <3.11 [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064077 (owner: 10CDanis) [19:34:09] (03Merged) 10jenkins-bot: victorops cli: COMMAND is required [software/klaxon] - 10https://gerrit.wikimedia.org/r/1064080 (owner: 10CDanis) [19:35:29] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2024.codfw.wmnet'] [19:43:46] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T372934 (10GOlson-WMF) 03NEW [19:46:15] (03PS1) 10BryanDavis: gitlab: Allow WMCS runners to use registry.cloud.releng.team [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) [19:53:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:58:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2024.codfw.wmnet'] [19:58:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240820T2000). Please do the needful. [20:00:05] chlod: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:16] o/ awake and aware [20:00:48] !log depool/restart/repool ms-fe2009 T360913 [20:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:52] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [20:01:37] !log depool/restart/repool ms-fe2011 T360913 [20:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:04] !log depool/restart/repool ms-fe2012 T360913 [20:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:33] !log depool/restart/repool ms-fe2014 T360913 [20:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:03] !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2024.codfw.wmnet'] [20:05:08] !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2024.codfw.wmnet'] [20:05:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [20:16:04] anyone around to handle the backport window? [20:23:17] (03CR) 10Dzahn: vrts: run install script on new server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1063733 (owner: 10AOkoth) [20:25:24] (03CR) 10Dzahn: "If we still need to support Icinga, there is a way to add "virtual hosts" (service names) to Icinga, using @monitoring::host." [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [20:26:31] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:28:44] (03PS1) 10Dzahn: acme_chief: remove outdated gerrit service names [puppet] - 10https://gerrit.wikimedia.org/r/1064091 [20:31:44] hi chlod 👋 - sorry lost track of time - do you still need a deployer? [20:32:05] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:32:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logging-sd2004.codfw.wmnet with OS bookworm [20:32:14] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10079255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host logging-sd2004.codfw.wmnet with OS bookworm completed: - logging-sd... [20:32:48] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:32:56] cjming: yep! [20:33:05] ok - let's go! [20:33:10] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [20:33:25] (03PS2) 10Chlod Alejandro: kawikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) [20:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:35:04] (03CR) 10Brennen Bearnes: "This seems... Fine I think? Although my limited understanding here (e.g. from reading T324361) is that there are no guarantees of long-ter" [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) (owner: 10BryanDavis) [20:35:05] (03PS1) 10Cathal Mooney: Add new reverse PTR includes for new netbox-generataed reverse zones [dns] - 10https://gerrit.wikimedia.org/r/1064095 (https://phabricator.wikimedia.org/T365651) [20:35:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:35:21] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10079264 (10Jhancock.wm) [20:35:30] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:47] (03Merged) 10jenkins-bot: kawikisource: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:36:07] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install logging-sd200[1-4] - https://phabricator.wikimedia.org/T370545#10079265 (10Jhancock.wm) 05Open→03Resolved @colewhite all finished here. servers are ready for you. [20:36:09] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1063914|kawikisource: add custom logos (T368868)]] [20:36:12] T368868: Set logos for new wikis - https://phabricator.wikimedia.org/T368868 [20:39:58] !log cjming@deploy1003 cjming, chlod: Backport for [[gerrit:1063914|kawikisource: add custom logos (T368868)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:40:02] chlod: 1st patch on test servers if you'd like to verify - lmk if/when i can sync [20:40:34] (03PS1) 10Andrea Denisse: alert: Add the alert[12]002 hosts to Prometheus blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/1064097 (https://phabricator.wikimedia.org/T372418) [20:41:30] youch, looks like this one has issues with the wordmark. is it possible to revert? [20:41:38] sure [20:41:57] !log cjming@deploy1003 Sync cancelled. [20:42:24] other ones shouldn't have issues, it seems to just be this specific wordmark [20:42:37] (03PS1) 10TrainBranchBot: Revert "kawikisource: add custom logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064099 [20:42:38] (03CR) 10TrainBranchBot: "cjming@deploy1003 created a revert of this change as I327c6a9138bd9a7385362cc9187ac77f650275ba" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063914 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:42:59] ok - moving onto the 2nd patch then [20:43:07] (03CR) 10Cathal Mooney: [C:03+2] Add new reverse PTR includes for new netbox-generataed reverse zones [dns] - 10https://gerrit.wikimedia.org/r/1064095 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney) [20:43:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064099 (owner: 10TrainBranchBot) [20:43:49] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T372916#10079278 (10Jhancock.wm) FYI, I'm getting an netbox reporting alert on mw2291. test_puppetdb_in_netbox 2024-08-20T20:40:02.169950+00:00 Failure — — expected... [20:44:50] (03Merged) 10jenkins-bot: Revert "kawikisource: add custom logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064099 (owner: 10TrainBranchBot) [20:45:09] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1064099|Revert "kawikisource: add custom logos"]] [20:45:51] (03PS2) 10Chlod Alejandro: kaawiktionary: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063915 (https://phabricator.wikimedia.org/T368868) [20:45:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#10079280 (10cmooney) @Jhancock.wm I've assigned new IPs for all these hosts on the //private1-c-codfw// and //private1-d-codfw// vlans... [20:45:59] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939 (10phaultfinder) 03NEW [20:47:39] (03CR) 10Ssingh: [C:03+1] Add new reverse PTR includes for new netbox-generataed reverse zones [dns] - 10https://gerrit.wikimedia.org/r/1064095 (https://phabricator.wikimedia.org/T365651) (owner: 10Cathal Mooney) [20:49:02] !log cjming@deploy1003 cjming, trainbranchbot: Backport for [[gerrit:1064099|Revert "kawikisource: add custom logos"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:49:25] !log cjming@deploy1003 cjming, trainbranchbot: Continuing with sync [20:52:57] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [20:53:56] !log imported php-defaults_92+wmf11u1 into component/php81 - T372507 [20:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:00] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [20:54:02] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064099|Revert "kawikisource: add custom logos"]] (duration: 08m 53s) [20:54:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063915 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:54:55] (03Merged) 10jenkins-bot: kaawiktionary: add custom logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063915 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [20:55:14] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1063915|kaawiktionary: add custom logos (T368868)]] [20:55:17] T368868: Set logos for new wikis - https://phabricator.wikimedia.org/T368868 [20:55:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Exclude legacy facts by default - https://phabricator.wikimedia.org/T372666#10079296 (10jhathaway) [20:55:44] (03CR) 10BryanDavis: "Ack on lack of durability guarantees in this registry. My particular use case today is "image exists for the duration of a GitLab CI pipel" [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) (owner: 10BryanDavis) [20:56:46] (03Abandoned) 10Andrea Denisse: alert: Ensure the alert[12]001 hosts use the spare::system role [puppet] - 10https://gerrit.wikimedia.org/r/1063231 (https://phabricator.wikimedia.org/T372607) (owner: 10Andrea Denisse) [20:56:51] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Strict mode enabled by default - https://phabricator.wikimedia.org/T372664#10079298 (10jhathaway) [20:57:35] !log cjming@deploy1003 cjming, chlod: Backport for [[gerrit:1063915|kaawiktionary: add custom logos (T368868)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:57:40] chlod: do you want to test 2nd patch? [20:57:47] yep, testing now [20:58:19] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [21:01:25] this one's got an issue with the logo, it's skewed too far left :( seems like this is the only patch with this specific issue as well. [21:01:41] hm - so revert? [21:01:49] yup [21:01:54] !log cjming@deploy1003 Sync cancelled. [21:02:24] (03PS1) 10TrainBranchBot: Revert "kaawiktionary: add custom logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064106 [21:02:24] (03CR) 10TrainBranchBot: "cjming@deploy1003 created a revert of this change as I654d4150858b602d16f5d53a82e64a653f4947c0" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063915 (https://phabricator.wikimedia.org/T368868) (owner: 10Chlod Alejandro) [21:02:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064106 (owner: 10TrainBranchBot) [21:03:33] (03Merged) 10jenkins-bot: Revert "kaawiktionary: add custom logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1064106 (owner: 10TrainBranchBot) [21:03:53] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1064106|Revert "kaawiktionary: add custom logos"]] [21:04:47] !log imported dh-php_5.4+wmf11u1 into component/php81 - T372507 [21:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:50] T372507: Prepare WMF PHP 8.1 packages for Bullseye - https://phabricator.wikimedia.org/T372507 [21:07:12] chlod: i'm happy to try one more -- or given how it's turned out, do you want to verify the rest of the patches ? [21:07:27] !log cjming@deploy1003 trainbranchbot, cjming: Backport for [[gerrit:1064106|Revert "kaawiktionary: add custom logos"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:32] !log cjming@deploy1003 trainbranchbot, cjming: Continuing with sync [21:09:03] will verify them more, better to be safe. thank you for the effort and sorry for the trouble! :( [21:10:07] no worries ! i'll make more of an effort to check in at the top of the window - pardon my lateness [21:12:11] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1064106|Revert "kaawiktionary: add custom logos"]] (duration: 08m 18s) [21:13:57] !log end of UTC late backport window [21:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:38] (03PS1) 10JHathaway: puppet8: add explicit typecast [puppet] - 10https://gerrit.wikimedia.org/r/1064108 (https://phabricator.wikimedia.org/T372664) [21:17:00] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1064108 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:19:52] (03PS1) 10Dzahn: gitlab: add replica hosts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) [21:20:08] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1064109" [cookbooks] - 10https://gerrit.wikimedia.org/r/1062394 (https://phabricator.wikimedia.org/T363564) (owner: 10Jelto) [21:21:07] (03CR) 10Dzahn: ".. if we want them in Icinga.. this should do it." [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [21:22:26] 10SRE-swift-storage, 06Commons: Cannot move Commons File:Dhruve_Sehgal_in_2021.png - https://phabricator.wikimedia.org/T372924#10079350 (10Aklapper) [21:22:37] 06SRE, 06DBA, 06serviceops: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943 (10CDanis) 03NEW [21:22:54] 06SRE, 06DBA, 06serviceops: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes - https://phabricator.wikimedia.org/T372943#10079364 (10CDanis) [21:24:26] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:14] (03PS1) 10JHathaway: puppet8: add db_user [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) [21:26:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:44] (03CR) 10JHathaway: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:26:48] (03CR) 10Dzahn: "all of these are "(NXDOMAIN)"" [puppet] - 10https://gerrit.wikimedia.org/r/1064091 (owner: 10Dzahn) [21:28:03] (03CR) 10JHathaway: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1064113 (https://phabricator.wikimedia.org/T372664) (owner: 10JHathaway) [21:30:26] (03PS1) 10Cwhite: logstash: add more fields to label normalization filter [puppet] - 10https://gerrit.wikimedia.org/r/1064114 [21:34:49] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372939#10079461 (10phaultfinder) [21:36:18] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Add check for packages in protected branches [puppet] - 10https://gerrit.wikimedia.org/r/1063015 (owner: 10EoghanGaffney) [21:36:28] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Change gitlab package puller to use paths instead of IDs [puppet] - 10https://gerrit.wikimedia.org/r/1064018 (owner: 10EoghanGaffney) [21:45:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [21:59:03] (03PS1) 10Ahmon Dancy: mw-debug/mw-web: Reduce CPU requests/limits for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064124 [22:01:36] (03CR) 10EoghanGaffney: [C:03+1] "Looks good, one nit on indentation but otherwise sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [22:02:15] !log dancy@deploy1003 Installing scap version "4.99.0" for 210 hosts [22:02:59] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: DegradedArray event on /dev/md/0:wdqs2024 - https://phabricator.wikimedia.org/T345542#10079534 (10bking) 05Resolved→03Open Hello @Jhancock.wm! Unfortunately, we are having trouble resurrecting this host. It's failed in the middle of a reimage 4 times n... [22:11:42] (03PS1) 10BCornwall: varnish: Remove unused browser security checks [puppet] - 10https://gerrit.wikimedia.org/r/1064125 (https://phabricator.wikimedia.org/T370200) [22:13:47] (03PS2) 10Ahmon Dancy: mw-debug/mw-web: Reduce CPU requests/limits for train-dev [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064124 [22:18:51] (03PS2) 10Andrea Denisse: alert: Add the alert[12]002 hosts to acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1064107 (https://phabricator.wikimedia.org/T372418) [22:19:26] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:52] (03CR) 10BCornwall: [V:03+1] "https://puppet-compiler.wmflabs.org/output/1064125/3702/cp1112.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1064125 (https://phabricator.wikimedia.org/T370200) (owner: 10BCornwall) [22:21:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:27:04] (03CR) 10Ahmon Dancy: [C:03+1] "OK w/ me." [puppet] - 10https://gerrit.wikimedia.org/r/1064084 (https://phabricator.wikimedia.org/T372848) (owner: 10BryanDavis) [22:28:47] (03PS2) 10Dzahn: gitlab: add replica hosts to Icinga [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) [22:31:22] (03CR) 10EoghanGaffney: [C:03+1] "lgtm, thanks for fixing the comments!" [puppet] - 10https://gerrit.wikimedia.org/r/1064109 (https://phabricator.wikimedia.org/T363564) (owner: 10Dzahn) [22:33:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:41:26] paged, but I think it's a victorops replay from earlier rather than a fresh alert, checking [22:42:48] no, it's fresh, I wonder why it didn't appear here [22:42:56] db1206 replag [22:44:32] !log rzl@cumin1002 dbctl commit (dc=all): 'db1206 depooled', diff saved to https://phabricator.wikimedia.org/P67402 and previous config saved to /var/cache/conftool/dbconfig/20240820-224431-rzl.json [22:59:19] filed https://phabricator.wikimedia.org/T372961 [23:11:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:35:24] (03CR) 10Scott French: [C:03+1] "Halves nearly everything. Seems sensible to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1064124 (owner: 10Ahmon Dancy) [23:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064133 [23:38:40] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1064133 (owner: 10TrainBranchBot)