[00:02:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) (owner: 10Jdlrobson) [00:04:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059951 (owner: 10TrainBranchBot) [00:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:08:29] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.17 [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1059955 (https://phabricator.wikimedia.org/T366962) [01:08:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.17 [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1059955 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [01:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:47] (03PS1) 10Scott French: deployment_server: mwscript_k8s uses report.json [puppet] - 10https://gerrit.wikimedia.org/r/1059956 [01:30:18] (03CR) 10Scott French: "Thanks +cc @adancy@wikimedia.org for pointing this out earlier today." [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (owner: 10Scott French) [01:32:44] (03PS2) 10Scott French: deployment_server: mwscript_k8s uses report.json [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) [01:35:54] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.17 [core] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1059955 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0200) [02:17:42] (03PS1) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [02:17:43] (03PS1) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [02:18:08] (03Abandoned) 10Andrew Bogott: wmf_sink: rip out the proxy-cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1059409 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [02:25:09] (03PS2) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [02:25:09] (03PS2) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [02:39:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0300) [03:00:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:30] (03PS1) 10TrainBranchBot: testwikis to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059962 (https://phabricator.wikimedia.org/T366962) [03:01:32] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059962 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [03:02:10] (03Merged) 10jenkins-bot: testwikis to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059962 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [03:02:28] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.43.0-wmf.17 refs T366962 [03:02:31] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [03:47:33] !log mwpresync@deploy1003 Finished scap: testwikis to 1.43.0-wmf.17 refs T366962 (duration: 45m 05s) [03:47:40] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0400) [04:01:04] !log mwpresync@deploy1003 Pruned MediaWiki: 1.43.0-wmf.14 (duration: 00m 58s) [04:04:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:26] (03PS1) 10Ryan Kemper: wdqs: add wdqs1021 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/1059963 (https://phabricator.wikimedia.org/T370754) [04:31:44] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] wdqs: add wdqs1021 to scap targets [puppet] - 10https://gerrit.wikimedia.org/r/1059963 (https://phabricator.wikimedia.org/T370754) (owner: 10Ryan Kemper) [04:36:09] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [04:36:19] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 09s) [04:36:50] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1023.eqiad.wmnet with OS bullseye [04:37:03] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [04:37:12] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 09s) [04:38:37] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1021.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [04:39:32] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [04:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:41:44] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0600) [06:00:05] marostegui, Amir1, and arnaudb: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:25] marostegui: if no db switch happening then, I would like to deploy mint/cxserver. [06:04:39] (In 15-20 minutes are fine) [06:04:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:24:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:43:22] (03CR) 10Slyngshede: [C:03+2] admin: add wmdecyn to analytics-privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [06:43:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10044069 (10SLyngshede-WMF) [06:43:28] Going ahead with MinT deployment first. [06:43:34] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057421 (https://phabricator.wikimedia.org/T363308) (owner: 10KartikMistry) [06:44:22] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10044067 (10SLyngshede-WMF) p:05Triage→03Medium [06:44:29] (03Merged) 10jenkins-bot: Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057421 (https://phabricator.wikimedia.org/T363308) (owner: 10KartikMistry) [06:45:24] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10044074 (10SLyngshede-WMF) [06:46:06] (03CR) 10Slyngshede: [C:03+2] admin: add Joely Rooke (WMDE) to analytics-privatedata, no shell acccess [puppet] - 10https://gerrit.wikimedia.org/r/1059944 (https://phabricator.wikimedia.org/T371584) (owner: 10Dzahn) [06:46:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10044075 (10SLyngshede-WMF) [06:46:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10044076 (10SLyngshede-WMF) [06:47:47] (03PS2) 10Filippo Giunchedi: sre.hosts.reimage: skip asking for puppet version past bullseye [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 [06:47:57] (03CR) 10Filippo Giunchedi: sre.hosts.reimage: skip asking for puppet version past bullseye (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 (owner: 10Filippo Giunchedi) [06:49:51] (03PS1) 10Stevemunene: Upgrade airflow test instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1059969 (https://phabricator.wikimedia.org/T365449) [06:50:13] (03PS2) 10Slyngshede: admin: add Joely Rooke (WMDE) to analytics-privatedata, no shell acccess [puppet] - 10https://gerrit.wikimedia.org/r/1059944 (https://phabricator.wikimedia.org/T371584) (owner: 10Dzahn) [06:50:32] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 15169 [06:50:37] (03Abandoned) 10Stevemunene: Upgrade airflow test instance version to v2.9.2 [puppet] - 10https://gerrit.wikimedia.org/r/1054329 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [06:51:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10044097 (10SLyngshede-WMF) 05Open→03Resolved [06:51:34] ah. Wrong docker tag. Fixing. [06:51:47] (03PS1) 10KartikMistry: Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059970 [06:53:28] (03CR) 10Slyngshede: [C:03+2] admin: add Joely Rooke (WMDE) to analytics-privatedata, no shell acccess [puppet] - 10https://gerrit.wikimedia.org/r/1059944 (https://phabricator.wikimedia.org/T371584) (owner: 10Dzahn) [06:55:09] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10044103 (10SLyngshede-WMF) [06:56:12] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10044100 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Medium [06:56:44] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059970 (owner: 10KartikMistry) [06:57:37] (03Merged) 10jenkins-bot: Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059970 (owner: 10KartikMistry) [06:58:33] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [07:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:03:24] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [07:04:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:56] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [07:06:23] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15169 [07:07:44] 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q1): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#10044130 (10LSobanski) There is now one other similar alert that is over a month old: Linting problems found for MediawikiPageContentChang... [07:07:46] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10044131 (10seanleong-WMDE) Hi @KFrancis, I've sent the details to your email. Thanks! [07:09:37] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline, still small improvements to do and then we're good to go" [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) (owner: 10AOkoth) [07:13:49] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:13:55] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61642 [07:14:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61642 [07:14:31] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 264014 [07:14:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 264014 [07:14:56] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 265158 [07:15:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 265158 [07:15:13] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 61928 [07:15:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 61928 [07:15:35] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 262725 [07:16:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 262725 [07:16:09] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 142108 [07:16:21] (03CR) 10Filippo Giunchedi: [C:03+1] Postgres prom exporter: ignore queries.yaml on >= bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [07:18:57] (03CR) 10Ayounsi: [V:03+1 C:03+2] Postgres prom exporter: ignore queries.yaml on >= bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [07:19:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:04] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:30:23] (03CR) 10Elukey: [C:03+1] sre.hosts.reimage: skip asking for puppet version past bullseye [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 (owner: 10Filippo Giunchedi) [07:32:13] (03CR) 10Filippo Giunchedi: [C:03+2] sre.hosts.reimage: skip asking for puppet version past bullseye [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 (owner: 10Filippo Giunchedi) [07:34:01] !log powercycle ml-serve2001 - host seems frozen, DIMM errors registered in `getsel` [07:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:43] (03PS2) 10KartikMistry: Update cxserver to 2024-08-05-063332-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059746 (https://phabricator.wikimedia.org/T371760) [07:37:44] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:39:03] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-08-05-063332-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059746 (https://phabricator.wikimedia.org/T371760) (owner: 10KartikMistry) [07:39:57] !log Updated MinT to 2024-08-05-062247-production (T363308, T355304, T368521) [07:40:00] (03Merged) 10jenkins-bot: Update cxserver to 2024-08-05-063332-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059746 (https://phabricator.wikimedia.org/T371760) (owner: 10KartikMistry) [07:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:02] T363308: Server-side caching for MinT - https://phabricator.wikimedia.org/T363308 [07:40:02] T355304: Enable Softcatalà models for more language pairs in MinT test instance - https://phabricator.wikimedia.org/T355304 [07:40:03] T368521: MinT leaks template metadata in translated content - https://phabricator.wikimedia.org/T368521 [07:42:20] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [07:42:40] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:42:50] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:43:08] (03CR) 10Elukey: [V:03+1 C:03+2] puppet: unset GIT_INDEX_FILE env var in post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1059899 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [07:43:52] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:44:12] <_joe_> !log uploaded conftool 3.2.1 to apt.wikimedia.org [07:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:25] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:45:34] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:45:44] <_joe_> I am going to manually upgrade conftool on one host [07:46:08] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:49:04] !log oblivian@puppetserver1002 conftool action : set/weight=1; selector: cluster=videoscaler,name=mw1407.eqiad.wmnet [07:49:10] !log oblivian@puppetserver1002 conftool action : set/weight=10; selector: cluster=videoscaler,name=mw1407.eqiad.wmnet [07:50:28] !log Updated cxserver to 2024-08-05-063332-production (T371760, T357950) [07:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:32] T371760: Post-creation work for bdrwiki - https://phabricator.wikimedia.org/T371760 [07:50:33] T357950: Remove servicerunner dependency for cxserver - https://phabricator.wikimedia.org/T357950 [07:51:40] (03PS1) 10DCausse: wdqs: set proper kafka topics for main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) [07:58:24] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10044227 (10elukey) Fix for the `dump-cloud-ip-ranges` timer/unit rolled out, I also tried to do a manual puppet private commi... [08:00:05] jnuche and brennen: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0800). [08:00:17] morning, I'll start the train in a few minutes [08:03:07] (03PS1) 10TrainBranchBot: group0 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060060 (https://phabricator.wikimedia.org/T366962) [08:03:09] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060060 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [08:03:47] (03Merged) 10jenkins-bot: group0 to 1.43.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060060 (https://phabricator.wikimedia.org/T366962) (owner: 10TrainBranchBot) [08:12:07] !log elukey@puppetserver1001 conftool action : set/pooled=no; selector: name=wdqs1023.eqiad.wmnet [08:13:32] (03CR) 10Vgutierrez: varnish: Add restrictive CSP to upload.wikimedia.org and add tests (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [08:13:42] (03PS1) 10Filippo Giunchedi: data-engineering: fix MediawikiPageContentChangeEnrichAvailability matching [alerts] - 10https://gerrit.wikimedia.org/r/1060061 [08:16:23] !log powercycle wdqs1023, misbehaving and not responding to ssh anymore [08:16:25] !log jnuche@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.43.0-wmf.17 refs T366962 [08:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:37] T366962: 1.43.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T366962 [08:16:57] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872 (10klausman) 03NEW [08:19:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:25:06] (03PS1) 10Ayounsi: Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) [08:25:30] (03CR) 10CI reject: [V:04-1] Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:25:48] (03PS2) 10Ayounsi: Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) [08:26:13] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:26:24] (03PS3) 10Ayounsi: Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) [08:26:25] (03CR) 10CI reject: [V:04-1] Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:26:46] (03CR) 10CI reject: [V:04-1] Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:27:29] (03PS4) 10Ayounsi: Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) [08:30:34] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:31:14] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:31:52] (03PS1) 10DCausse: wdqs: fix monitoring ua for main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060065 [08:34:12] (03PS1) 10Filippo Giunchedi: pontoon: fix reboot hosts fqdn vs hostname [puppet] - 10https://gerrit.wikimedia.org/r/1060068 [08:34:12] (03PS1) 10Filippo Giunchedi: webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 [08:34:12] (03PS1) 10Filippo Giunchedi: benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 [08:34:13] (03PS1) 10Filippo Giunchedi: pontoon: restore Benthos instances functionality [puppet] - 10https://gerrit.wikimedia.org/r/1060071 [08:34:35] (03CR) 10CI reject: [V:04-1] pontoon: fix reboot hosts fqdn vs hostname [puppet] - 10https://gerrit.wikimedia.org/r/1060068 (owner: 10Filippo Giunchedi) [08:34:41] (03CR) 10CI reject: [V:04-1] webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 (owner: 10Filippo Giunchedi) [08:34:49] (03CR) 10CI reject: [V:04-1] benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 (owner: 10Filippo Giunchedi) [08:36:11] (03CR) 10Ayounsi: [V:03+1] "PCC happy." [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [08:36:17] (03PS2) 10Filippo Giunchedi: pontoon: fix reboot hosts fqdn vs hostname [puppet] - 10https://gerrit.wikimedia.org/r/1060068 [08:36:17] (03PS2) 10Filippo Giunchedi: webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 [08:36:17] (03PS2) 10Filippo Giunchedi: benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 [08:36:17] (03PS2) 10Filippo Giunchedi: pontoon: restore Benthos instances functionality [puppet] - 10https://gerrit.wikimedia.org/r/1060071 [08:36:45] (03CR) 10CI reject: [V:04-1] webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 (owner: 10Filippo Giunchedi) [08:36:53] (03CR) 10CI reject: [V:04-1] benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 (owner: 10Filippo Giunchedi) [08:39:02] (03PS3) 10Filippo Giunchedi: webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 [08:39:02] (03PS3) 10Filippo Giunchedi: benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 [08:39:02] (03PS3) 10Filippo Giunchedi: pontoon: restore Benthos instances functionality [puppet] - 10https://gerrit.wikimedia.org/r/1060071 [08:39:56] (03PS2) 10DCausse: wdqs: fix monitoring ua for internal, main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060065 [08:39:56] (03PS2) 10DCausse: wdqs: set proper kafka topics for main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) [08:42:28] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Openjdk upgrade - elukey@cumin1002 [08:43:11] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: fix reboot hosts fqdn vs hostname [puppet] - 10https://gerrit.wikimedia.org/r/1060068 (owner: 10Filippo Giunchedi) [08:43:12] !log shutting cloudsw1-d5-eqiad <-> cloudsw1-e4-eqiad link [08:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:01] (03PS5) 10Ayounsi: Netbox prometheus: replace exporter script with plugin [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) [08:53:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10044349 (10jijiki) 05Open→03Resolved @Jclark-ctr Thank you! Closing for now and will reopen if the problem persists [08:54:13] (03CR) 10Ayounsi: "It also seems safe to use a regex instead of specifying each label to delete as :" [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:02:25] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Openjdk upgrade - elukey@cumin1002 [09:07:15] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Openjdk upgrade - elukey@cumin1002 [09:10:28] (03PS1) 10Ayounsi: Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:10:29] (03PS1) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [09:10:53] (03CR) 10CI reject: [V:04-1] Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:11:03] (03PS1) 10DCausse: cirrus-streaming-updater: bump image to v20240806085845-f838190 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060076 (https://phabricator.wikimedia.org/T328330) [09:12:04] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:14:33] dcausse: was your change at https://gerrit.wikimedia.org/r/1060065 what resolved the wdqs lag? We got a ticket filed about the lag (https://phabricator.wikimedia.org/T371871) and I wanted to add some more details about the cause, but I can't quite parse the cause from the commit message / changes. [09:15:23] codders: thanks for the pointer! will comment on the ticket to explain what happened [09:15:33] nice - thanks! [09:15:41] (03PS1) 10Ayounsi: Remove custom_script_proxy.py and getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1060079 (https://phabricator.wikimedia.org/T311052) [09:19:43] (03PS2) 10Ayounsi: Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:19:43] (03PS2) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [09:20:01] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:20:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T367856)', diff saved to https://phabricator.wikimedia.org/P67227 and previous config saved to /var/cache/conftool/dbconfig/20240806-092212-marostegui.json [09:22:16] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Openjdk upgrade - elukey@cumin1002 [09:26:46] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1060065 (owner: 10DCausse) [09:27:26] (03CR) 10Btullis: "Cool. Just wondering why you prefer to do it here, rather than in the helmfile?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:29:15] (03CR) 10Brouberol: "This is personal preference really. I'd rather the helmfile does not contain much (and TBH, any) custom YAML resource. Instead, have them " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:29:51] (03CR) 10Btullis: [C:03+1] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:30:11] (03PS3) 10Ayounsi: Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) [09:30:12] (03PS3) 10Ayounsi: Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) [09:31:19] (03CR) 10Btullis: [C:03+1] "Yes, that's fully understandable. I was only thinking of trying to minimise diversion from the upstream chart, in case we want to track th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [09:33:19] (03CR) 10CI reject: [V:04-1] Netbox script proxy: set to absent where possible [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:33:32] (03CR) 10CI reject: [V:04-1] Remove profile::netbox::scripts from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:35:22] (03CR) 10Stevemunene: [C:03+1] wdqs: set proper kafka topics for main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) (owner: 10DCausse) [09:36:04] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P67228 and previous config saved to /var/cache/conftool/dbconfig/20240806-093719-marostegui.json [09:39:55] marostegui: I'm about to update confctl [09:40:03] and dbctl [09:40:11] FYI [09:40:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:40:49] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:40:51] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060075 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:41:03] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1060074 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:41:12] (03PS1) 10Jelto: etherpad: make defaultPadText more explicit about personal use [puppet] - 10https://gerrit.wikimedia.org/r/1060082 (https://phabricator.wikimedia.org/T371591) [09:41:27] !log upgrading conftool to 3.2.1 everywhere T369606 [09:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] T369606: Allow integrating requestctl rules into haproxy - https://phabricator.wikimedia.org/T369606 [09:42:42] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add confd files for ipblock maps [puppet] - 10https://gerrit.wikimedia.org/r/1059457 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [09:43:03] (03PS3) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [09:43:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3558/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060082 (https://phabricator.wikimedia.org/T371591) (owner: 10Jelto) [09:44:02] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [09:47:39] (03PS4) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [09:49:53] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [09:50:37] (03CR) 10Jelto: [C:04-1] "Thanks for the clarification, that makes the change clearer. However, I'm still slightly against patching the Etherpad package. The build " [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1059036 (https://phabricator.wikimedia.org/T371591) (owner: 10Aklapper) [09:52:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P67229 and previous config saved to /var/cache/conftool/dbconfig/20240806-095226-marostegui.json [09:53:52] jouncebot: nowandnext [09:53:52] For the next 0 hour(s) and 6 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T0800) [09:53:52] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1000) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1000) [10:00:08] (03PS1) 10Abijeet Patro: TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) [10:00:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:02:34] (03PS1) 10Abijeet Patro: TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060086 (https://phabricator.wikimedia.org/T366455) [10:03:30] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879 (10cmooney) 03NEW p:05Triage→03High [10:04:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Translate] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060086 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:07:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T367856)', diff saved to https://phabricator.wikimedia.org/P67231 and previous config saved to /var/cache/conftool/dbconfig/20240806-100734-marostegui.json [10:07:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:07:37] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:07:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:07:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T367856)', diff saved to https://phabricator.wikimedia.org/P67232 and previous config saved to /var/cache/conftool/dbconfig/20240806-100756-marostegui.json [10:09:53] (03PS1) 10David Caro: common: add dcaro user for access to cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 [10:10:45] (03CR) 10David Caro: [C:04-1] common: add dcaro user for access to cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [10:11:03] (03PS2) 10David Caro: common: add dcaro user for access to cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 [10:12:26] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [10:20:22] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10044547 (10cmooney) [10:27:51] (03CR) 10CI reject: [V:04-1] TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [10:28:50] (03Abandoned) 10Aklapper: Make Etherpad frontpage say it's not for personal use [debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/1059036 (https://phabricator.wikimedia.org/T371591) (owner: 10Aklapper) [10:29:24] (03CR) 10Ayounsi: [C:03+1] "+1 on the principle, with 2 comments." [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [10:29:30] (03CR) 10Aklapper: [C:03+1] etherpad: make defaultPadText more explicit about personal use [puppet] - 10https://gerrit.wikimedia.org/r/1060082 (https://phabricator.wikimedia.org/T371591) (owner: 10Jelto) [10:32:45] (03PS4) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [10:33:05] (03PS5) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [10:33:57] (03CR) 10CI reject: [V:04-1] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:38:28] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) (owner: 10DCausse) [10:39:03] (03CR) 10Stevemunene: [C:03+2] wdqs: fix monitoring ua for internal, main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060065 (owner: 10DCausse) [10:40:02] (03PS6) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [10:40:41] (03CR) 10CI reject: [V:04-1] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [10:46:28] (03CR) 10David Caro: common: add dcaro user for access to cloudsw (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [10:47:02] (03CR) 10David Caro: common: add dcaro user for access to cloudsw (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1060087 (owner: 10David Caro) [10:57:35] (03PS3) 10Hnowlan: group0, frwiki, itwiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) [10:58:25] (03PS4) 10Hnowlan: group0, frwiki, itwiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) [11:03:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [11:03:39] (03PS5) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [11:04:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:33] (03PS6) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [11:08:32] (03PS7) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [11:14:10] (03PS8) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [11:15:03] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3562/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [11:19:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:30] (03PS1) 10Btullis: Add the postgresql prometheus exporter to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1060091 (https://phabricator.wikimedia.org/T371877) [11:22:36] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3563/co" [puppet] - 10https://gerrit.wikimedia.org/r/1060091 (https://phabricator.wikimedia.org/T371877) (owner: 10Btullis) [11:25:35] (03PS1) 10Slyngshede: Wikimedia: New management command for blocking users in systems. [software/bitu] - 10https://gerrit.wikimedia.org/r/1060092 (https://phabricator.wikimedia.org/T359820) [11:29:49] (03PS9) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [11:31:19] (03CR) 10CI reject: [V:04-1] (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) (owner: 10Effie Mouzeli) [11:32:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:32:18] (03CR) 10Stevemunene: [C:03+1] Add the postgresql prometheus exporter to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1060091 (https://phabricator.wikimedia.org/T371877) (owner: 10Btullis) [11:33:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060) (owner: 10Anzx) [11:33:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) (owner: 10Anzx) [11:35:01] (03PS3) 10Anzx: mywikisource: add portal, author and translation namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060) [11:35:12] (03PS3) 10Anzx: dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) [11:37:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:39:07] (03CR) 10Btullis: [V:03+1 C:03+2] Add the postgresql prometheus exporter to an-db100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1060091 (https://phabricator.wikimedia.org/T371877) (owner: 10Btullis) [11:39:56] (03PS10) 10Effie Mouzeli: (DNM WIP) wikitech: de-wikitech mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059339 (https://phabricator.wikimedia.org/T371537) [11:42:26] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:41] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1200) [12:09:41] (03PS1) 10Elukey: Move debmonitor discovery record to debmonitor2003 [dns] - 10https://gerrit.wikimedia.org/r/1060094 (https://phabricator.wikimedia.org/T368744) [12:11:16] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on debmonitor1003.eqiad.wmnet with reason: failover test [12:11:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on debmonitor1003.eqiad.wmnet with reason: failover test [12:13:50] !log stop debmonitor-server on debmonitor1003 as temporary test [12:13:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:22] !log restart debmonitor-server on debmonitor1003 [12:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:58] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Openjdk upgrade - elukey@cumin1002 [12:24:12] (03PS8) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [12:26:34] (03CR) 10Brouberol: [C:03+1] Upgrade airflow test instance version to v2.9.3 [puppet] - 10https://gerrit.wikimedia.org/r/1059969 (https://phabricator.wikimedia.org/T365449) (owner: 10Stevemunene) [12:27:29] (03PS9) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [12:31:06] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on debmonitor2003.codfw.wmnet with reason: failover test [12:31:19] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on debmonitor2003.codfw.wmnet with reason: failover test [12:32:48] !log apt-get purge debmonitor-server + run-puppet-agent to re-install the daemon on debmonitor2003 [12:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:23] (03CR) 10Btullis: [C:03+1] "Nice." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:38:10] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:39:40] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Openjdk upgrade - elukey@cumin1002 [12:39:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:39:56] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:41:05] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:41:51] hello o/ I have a couple of patches for backport for the Translate extension. The CI takes quite a while to merge the patches in. Patches: 1060085: TranslatablePage: Use local cache to reduce calls to the WAN cache | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/1060085 [12:46:53] (03PS18) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [12:48:03] (03CR) 10CI reject: [V:04-1] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [12:53:01] (03PS1) 10Brouberol: cloudnative-pg: set image tag proven to work [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) [12:54:27] (03CR) 10Zabe: [C:03+2] TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060086 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:54:32] (03CR) 10Zabe: [C:03+2] TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [12:54:49] (03CR) 10Btullis: cloudnative-pg: set image tag proven to work (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:55:29] jouncebot: nowandnext [12:55:29] For the next 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1200) [12:55:29] In 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1300) [12:55:54] abijeet: hit +2 on your patches to get CI runnin [12:55:55] (03CR) 10Brouberol: cloudnative-pg: set image tag proven to work (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:56:48] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump image to v20240806085845-f838190 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060076 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [12:57:01] abijeet: but it seems that for the wmf.16 patch there is a non-deterministic failure [12:57:02] zabe, thanks [12:57:09] ah [12:57:16] determistic not non-deterministic [12:57:39] (03CR) 10Btullis: [C:03+1] "Looks good." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:57:46] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump image to v20240806085845-f838190 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060076 (https://phabricator.wikimedia.org/T328330) (owner: 10DCausse) [12:57:57] (03CR) 10Btullis: [C:03+1] cloudnative-pg: set image tag proven to work (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [12:58:20] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:58:39] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1300) [13:00:05] MichaelG_WMF, abijeet, hnowlan, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] I can deploy [13:00:12] * MichaelG_WMF is here [13:00:22] o/ [13:00:39] o/ [13:01:09] (03PS2) 10Wangombe: Update reference to ElasticSearchTtmServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1054869 (https://phabricator.wikimedia.org/T335342) [13:01:11] (03CR) 10Zabe: [C:03+2] [Growth] enwiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059370 (https://phabricator.wikimedia.org/T370802) (owner: 10Urbanecm) [13:01:18] fyi my patch can't be tested on mwdebug (but we've extensively tested it on testwiki) so it can go straight to prod [13:01:25] it hits the jobrunners [13:01:29] (03CR) 10Brouberol: [C:03+2] cloudnative-pg: set image tag proven to work [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060097 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:01:36] (03CR) 10Zabe: [C:03+2] group0, frwiki, itwiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:01:53] (03Merged) 10jenkins-bot: [Growth] enwiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059370 (https://phabricator.wikimedia.org/T370802) (owner: 10Urbanecm) [13:01:55] alright [13:02:20] (03Merged) 10jenkins-bot: group0, frwiki, itwiki: enable shellbox-video [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1050378 (https://phabricator.wikimedia.org/T356241) (owner: 10Hnowlan) [13:03:38] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1050378|group0, frwiki, itwiki: enable shellbox-video (T356241)]], [[gerrit:1059370|[Growth] enwiki: Enable frontend for Add Link (T370802)]] [13:03:42] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:03:47] T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia - https://phabricator.wikimedia.org/T370802 [13:06:16] (03CR) 10CI reject: [V:04-1] TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:06:51] (03CR) 10Abijeet Patro: "recheck" [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:07:47] (03CR) 10Tiziano Fogli: "Yes @ltoscano@wikimedia.org, I can confirm that the checklist is completed." [puppet] - 10https://gerrit.wikimedia.org/r/1058565 (owner: 10Tiziano Fogli) [13:08:09] i don't think I'll be able to fix the CI failures before the deployment window ends. nothing in the patch should cause that failure. [13:08:14] !log zabe@deploy1003 hnowlan, urbanecm, zabe: Backport for [[gerrit:1050378|group0, frwiki, itwiki: enable shellbox-video (T356241)]], [[gerrit:1059370|[Growth] enwiki: Enable frontend for Add Link (T370802)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:29] * MichaelG_WMF is looking [13:08:30] MichaelG_WMF: can you test? [13:08:35] testing [13:08:37] abijeet: okay [13:09:03] zabe: looks good! [13:09:43] is that patch important for wmf.16? otherwise it can just run through with the train I guess? [13:09:47] !log zabe@deploy1003 hnowlan, urbanecm, zabe: Continuing with sync [13:10:00] abijeet: maybe related T371577 [13:10:01] T371577: GlobalBlockListPagerTest::testFormatRow integration test breaking CI - https://phabricator.wikimedia.org/T371577 [13:11:07] might need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalBlocking/+/1058729 to wmf.16 as well for your patch to pass ci :( [13:11:18] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [13:11:26] (03PS1) 10Zabe: Fix test that only works in June or July [extensions/GlobalBlocking] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060101 (https://phabricator.wikimedia.org/T371577) [13:11:33] (03CR) 10Zabe: [C:03+2] Fix test that only works in June or July [extensions/GlobalBlocking] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060101 (https://phabricator.wikimedia.org/T371577) (owner: 10Zabe) [13:11:48] dcausse, thanks, that makes sense. [13:11:50] just gonna do that, its not gonna hurt [13:12:06] zabe: thanks :) [13:12:48] (03PS2) 10Zabe: TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:13:03] (03CR) 10Zabe: [C:03+2] TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:13:13] <_joe_> !log depooling cp4044 from traffic to apply new tls termination templates [13:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:22] (03CR) 10Filippo Giunchedi: [C:03+2] admin: promote tappof to root [puppet] - 10https://gerrit.wikimedia.org/r/1058565 (owner: 10Tiziano Fogli) [13:14:20] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1050378|group0, frwiki, itwiki: enable shellbox-video (T356241)]], [[gerrit:1059370|[Growth] enwiki: Enable frontend for Add Link (T370802)]] (duration: 10m 41s) [13:14:22] (03CR) 10Zabe: [C:03+2] dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) (owner: 10Anzx) [13:14:23] T356241: Move video transcoding to use Shellbox - https://phabricator.wikimedia.org/T356241 [13:14:23] (03CR) 10Zabe: [C:03+2] mywikisource: add portal, author and translation namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060) (owner: 10Anzx) [13:14:23] T370802: Add a link (Structured task): Release as "turned off" to English Wikipedia - https://phabricator.wikimedia.org/T370802 [13:14:42] MichaelG_WMF: hnowlan: done :) [13:15:38] zabe: thank you! [13:15:43] Thank you! [13:15:44] (03PS4) 10Anzx: dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) [13:15:48] (03Merged) 10jenkins-bot: mywikisource: add portal, author and translation namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057824 (https://phabricator.wikimedia.org/T371060) (owner: 10Anzx) [13:15:50] (03CR) 10Zabe: [C:03+2] dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) (owner: 10Anzx) [13:16:42] (03Merged) 10jenkins-bot: dtpwiki: add timezone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057193 (https://phabricator.wikimedia.org/T371076) (owner: 10Anzx) [13:17:06] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1057824|mywikisource: add portal, author and translation namespaces (T371060)]], [[gerrit:1057193|dtpwiki: add timezone (T371076)]] [13:17:10] T371060: Configure namespaces for mywikisource - https://phabricator.wikimedia.org/T371060 [13:17:11] T371076: Set timezone for dtpwiki - https://phabricator.wikimedia.org/T371076 [13:19:43] (03PS1) 10Giuseppe Lavagetto: haproxy: remove complex quoting [puppet] - 10https://gerrit.wikimedia.org/r/1060103 [13:20:23] (03CR) 10CDanis: [C:03+1] haproxy: remove complex quoting [puppet] - 10https://gerrit.wikimedia.org/r/1060103 (owner: 10Giuseppe Lavagetto) [13:20:36] !log zabe@deploy1003 anzx, zabe: Backport for [[gerrit:1057824|mywikisource: add portal, author and translation namespaces (T371060)]], [[gerrit:1057193|dtpwiki: add timezone (T371076)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:20:38] zabe: checking [13:20:45] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: remove complex quoting [puppet] - 10https://gerrit.wikimedia.org/r/1060103 (owner: 10Giuseppe Lavagetto) [13:21:22] (03Merged) 10jenkins-bot: TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060086 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:21:48] (03Merged) 10jenkins-bot: Fix test that only works in June or July [extensions/GlobalBlocking] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060101 (https://phabricator.wikimedia.org/T371577) (owner: 10Zabe) [13:22:06] (03CR) 10Southparkfan: [C:03+1] Update the mediawiki-installation dsh group with new beta snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/1059893 (https://phabricator.wikimedia.org/T370465) (owner: 10Btullis) [13:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:10] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_abuse.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:23:55] zabe: looks good [13:24:10] !log zabe@deploy1003 anzx, zabe: Continuing with sync [13:24:13] cool, syncing [13:27:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:58] (03PS1) 10Hnowlan: shellbox-video, admin_ng: bump resource limits and replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060104 (https://phabricator.wikimedia.org/T356241) [13:28:34] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1057824|mywikisource: add portal, author and translation namespaces (T371060)]], [[gerrit:1057193|dtpwiki: add timezone (T371076)]] (duration: 11m 28s) [13:28:38] T371060: Configure namespaces for mywikisource - https://phabricator.wikimedia.org/T371060 [13:28:39] T371076: Set timezone for dtpwiki - https://phabricator.wikimedia.org/T371076 [13:29:13] (03PS1) 10Giuseppe Lavagetto: haproxy: move check command to a separate script [puppet] - 10https://gerrit.wikimedia.org/r/1060105 [13:29:23] zabe: we need to revert https://gerrit.wikimedia.org/r/1057824 , there is i had used mswikisource instead of mywikisource [13:30:40] could you also write a fixing patch? [13:31:10] Zabe: I can write fix [13:31:48] (03CR) 10CI reject: [V:04-1] haproxy: move check command to a separate script [puppet] - 10https://gerrit.wikimedia.org/r/1060105 (owner: 10Giuseppe Lavagetto) [13:32:26] zabe, i think we might be able to just make it within the window. [13:32:31] (03CR) 10CDanis: [C:03+1] "lgtm one nit" [puppet] - 10https://gerrit.wikimedia.org/r/1060105 (owner: 10Giuseppe Lavagetto) [13:34:05] (03PS2) 10Giuseppe Lavagetto: haproxy: move check command to a separate script [puppet] - 10https://gerrit.wikimedia.org/r/1060105 [13:35:31] (03PS3) 10Giuseppe Lavagetto: haproxy: move check command to a separate script [puppet] - 10https://gerrit.wikimedia.org/r/1060105 [13:35:43] (03CR) 10Giuseppe Lavagetto: haproxy: move check command to a separate script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1060105 (owner: 10Giuseppe Lavagetto) [13:37:07] (03PS4) 10Anzx: mywikisource: fix namespace dbname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060106 (https://phabricator.wikimedia.org/T371060) [13:38:06] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: move check command to a separate script [puppet] - 10https://gerrit.wikimedia.org/r/1060105 (owner: 10Giuseppe Lavagetto) [13:39:01] (03Merged) 10jenkins-bot: TranslatablePage: Use local cache to reduce calls to the WAN cache [extensions/Translate] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060085 (https://phabricator.wikimedia.org/T366455) (owner: 10Abijeet Patro) [13:39:16] zabe: fixed in https://gerrit.wikimedia.org/r/1060106 [13:41:35] (03CR) 10Zabe: [C:03+2] mywikisource: fix namespace dbname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060106 (https://phabricator.wikimedia.org/T371060) (owner: 10Anzx) [13:41:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host vrts2002.codfw.wmnet with OS bookworm [13:41:46] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10045035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host vrts2002.codfw.wmnet with OS bookworm [13:42:24] (03Merged) 10jenkins-bot: mywikisource: fix namespace dbname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060106 (https://phabricator.wikimedia.org/T371060) (owner: 10Anzx) [13:43:33] (03PS1) 10Giuseppe Lavagetto: haproxy: fix etcd key path [puppet] - 10https://gerrit.wikimedia.org/r/1060109 [13:43:41] !log zabe@deploy1003 Started scap sync-world: T371060 [13:43:43] T371060: Configure namespaces for mywikisource - https://phabricator.wikimedia.org/T371060 [13:44:01] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: fix etcd key path [puppet] - 10https://gerrit.wikimedia.org/r/1060109 (owner: 10Giuseppe Lavagetto) [13:48:10] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_abuse.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:48:25] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_abuse.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:49:22] this is expected given the other work [13:51:39] !log zabe@deploy1003 Finished scap: T371060 (duration: 07m 57s) [13:51:42] T371060: Configure namespaces for mywikisource - https://phabricator.wikimedia.org/T371060 [13:52:20] zabe: can you run namespacedupes.php for both mswikisource and mywikisource [13:53:10] FIRING: [3x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_abuse.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:53:10] !log cdobbins@cumin1002:~$ sudo cumin 'A:cp' 'disable-puppet "merging CR #1059123"' [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:21] yes [13:53:27] did a dry run for both [13:53:35] and its 0 to fix in both cases [13:54:01] thanks [13:54:06] !log upgrading A:wikidough to pdns-rec 4.8.8 [13:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:51] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1060085|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]], [[gerrit:1060101|Fix test that only works in June or July (T371577)]], [[gerrit:1060086|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]] [13:54:53] T371577: GlobalBlockListPagerTest::testFormatRow integration test breaking CI - https://phabricator.wikimedia.org/T371577 [13:55:17] yw [13:55:31] abijeet: we can finally do yours:) [13:55:32] (03PS43) 10CDobbins: varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) [13:55:58] zabe, thanks! [13:56:01] (03CR) 10CDobbins: [C:03+2] varnish: fix bug causing %error_body_content% to appear in response body [puppet] - 10https://gerrit.wikimedia.org/r/1059123 (https://phabricator.wikimedia.org/T371424) (owner: 10CDobbins) [13:56:52] !log zabe@deploy1003 abi, zabe: Backport for [[gerrit:1060085|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]], [[gerrit:1060101|Fix test that only works in June or July (T371577)]], [[gerrit:1060086|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:57:12] abijeet: is the patch testable? [13:57:48] zabe, yea, I'll do some general testing ... marking a page for translation, translate etc. [13:58:01] should take me a few minutes max [13:58:01] alright [13:58:27] I am here [13:58:57] effie, o/ [13:59:02] :) [14:00:05] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:00:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on vrts2002.codfw.wmnet with reason: host reimage [14:00:56] (03PS1) 10Giuseppe Lavagetto: haproxy: fix ipblock maps [puppet] - 10https://gerrit.wikimedia.org/r/1060111 [14:01:45] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [14:01:51] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [14:01:52] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] haproxy: fix ipblock maps [puppet] - 10https://gerrit.wikimedia.org/r/1060111 (owner: 10Giuseppe Lavagetto) [14:02:13] !log jclark@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:03:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on vrts2002.codfw.wmnet with reason: host reimage [14:03:33] zabe, looks good [14:03:35] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:03:48] cool, syncing [14:03:50] !log zabe@deploy1003 abi, zabe: Continuing with sync [14:05:09] (03CR) 10Dzahn: [C:03+2] etherpad: make defaultPadText more explicit about personal use [puppet] - 10https://gerrit.wikimedia.org/r/1060082 (https://phabricator.wikimedia.org/T371591) (owner: 10Jelto) [14:07:05] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker12 - jclark@cumin1002" [14:07:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker12 - jclark@cumin1002" [14:07:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:08:13] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1060085|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]], [[gerrit:1060101|Fix test that only works in June or July (T371577)]], [[gerrit:1060086|TranslatablePage: Use local cache to reduce calls to the WAN cache (T366455)]] (duration: 13m 22s) [14:08:16] T371577: GlobalBlockListPagerTest::testFormatRow integration test breaking CI - https://phabricator.wikimedia.org/T371577 [14:10:58] abijeet: should be live [14:11:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1276.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:31] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1279.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:33] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:11:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1281.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1282.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:39] zabe, thanks so much! [14:11:40] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:45] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1284.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:52] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1277.mgmt.eqiad.wmnet with reboot policy FORCED [14:11:54] yw [14:11:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1280.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:16] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1283.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:26] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1278.mgmt.eqiad.wmnet with reboot policy FORCED [14:12:28] (03CR) 10Stevemunene: [C:03+1] Update the beta cluster scap targets for dumps [dumps/scap] - 10https://gerrit.wikimedia.org/r/1059891 (https://phabricator.wikimedia.org/T370465) (owner: 10Btullis) [14:12:48] (03PS1) 10Zabe: Initial configuration for bdrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060112 (https://phabricator.wikimedia.org/T371757) [14:12:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [14:13:10] RESOLVED: [2x] ConfdResourceFailed: confd resource _etc_haproxy_ipblocks.d_abuse.map.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:13:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [14:13:49] (03PS1) 10Giuseppe Lavagetto: haproxy: add cache_cluster parameter, needed by requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060113 [14:14:13] (03CR) 10CDanis: [C:03+1] haproxy: add cache_cluster parameter, needed by requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060113 (owner: 10Giuseppe Lavagetto) [14:14:33] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2009 to codfw - jhancock@cumin2002" [14:14:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2009 to codfw - jhancock@cumin2002" [14:14:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:07] (03CR) 10Zabe: [C:03+2] Initial configuration for bdrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060112 (https://phabricator.wikimedia.org/T371757) (owner: 10Zabe) [14:15:42] (03Merged) 10jenkins-bot: Initial configuration for bdrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060112 (https://phabricator.wikimedia.org/T371757) (owner: 10Zabe) [14:17:08] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1272.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1274.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1275.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1270.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:46] !log Create Wikipedia West Coast Bajau # T371757 [14:17:48] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1272.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:49] T371757: Create Wikipedia West Coast Bajau - https://phabricator.wikimedia.org/T371757 [14:17:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1274.mgmt.eqiad.wmnet with reboot policy FORCED [14:17:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1270.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:07] !log zabe@deploy1003 Started scap sync-world: Creating bdrwiki (T371757) [14:18:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1270.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:44] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: add cache_cluster parameter, needed by requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1060113 (owner: 10Giuseppe Lavagetto) [14:19:36] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1274.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:20] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:34] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1272.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:33] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:21:37] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:23:14] !log upgrade debmonitor-server on debmonitor[1,2]003 to version 0.5 - cp /var/cache/apt/archives/python3-debmonitor_0.4.0-3_all.deb . [14:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:18] ufff [14:23:19] amending.. [14:23:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [14:23:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host vrts2002.codfw.wmnet with OS bookworm [14:23:35] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10045232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host vrts2002.codfw.wmnet with OS bookworm completed: - vrts2002 (**PASS*... [14:24:29] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10045234 (10Jhancock.wm) [14:24:50] !log zabe@deploy1003 Finished scap: Creating bdrwiki (T371757) (duration: 06m 43s) [14:24:53] T371757: Create Wikipedia West Coast Bajau - https://phabricator.wikimedia.org/T371757 [14:25:30] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=bdrwiki --cluster=all 2>&1 | tee /tmp/bdrwiki.UpdateSearchIndexConfig.log # T371757 [14:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:49] (03PS19) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [14:26:23] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install vrts2002 - https://phabricator.wikimedia.org/T369672#10045237 (10Jhancock.wm) 05Open→03Resolved @Arnoldokoth this is complete. all yours! [14:26:46] (03CR) 10CI reject: [V:04-1] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [14:27:09] (03Abandoned) 10Elukey: Move debmonitor discovery record to debmonitor2003 [dns] - 10https://gerrit.wikimedia.org/r/1060094 (https://phabricator.wikimedia.org/T368744) (owner: 10Elukey) [14:27:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:46] (03PS3) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [14:28:46] (03PS3) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [14:28:53] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424 [14:28:53] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060115 [14:28:53] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060115 (owner: 10Zabe) [14:28:56] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [14:29:06] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424 [14:29:19] (03CR) 10CI reject: [V:04-1] dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [14:29:23] (03PS1) 10Giuseppe Lavagetto: haproxy: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/1060116 [14:29:33] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060115 (owner: 10Zabe) [14:29:47] !log cdobbins@cumin1002:~$ sudo cumin 'A:cp' 'run-puppet-agent --enable "merging CR #1059123"' [14:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:06] !log zabe@deploy1003 Started scap sync-world: update interwiki cache [14:30:10] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:30:27] (03PS1) 10Andrew Bogott: nova-proxy api (invisible-unicorn.py): run through black [puppet] - 10https://gerrit.wikimedia.org/r/1060118 [14:31:03] (03CR) 10CI reject: [V:04-1] nova-proxy api (invisible-unicorn.py): run through black [puppet] - 10https://gerrit.wikimedia.org/r/1060118 (owner: 10Andrew Bogott) [14:32:57] (03CR) 10Giuseppe Lavagetto: [C:03+2] haproxy: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/1060116 (owner: 10Giuseppe Lavagetto) [14:33:23] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, please note that new metrics will be issued due to "job" label name change" [puppet] - 10https://gerrit.wikimedia.org/r/1060064 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [14:33:27] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp3081*} and A:cp for 9.2.5-1wm2 [14:33:32] (03CR) 10CDanis: [C:03+1] haproxy: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/1060116 (owner: 10Giuseppe Lavagetto) [14:34:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1276.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1277.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1279.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1281.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1283.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1284.mgmt.eqiad.wmnet with reboot policy FORCED [14:34:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1280.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:01] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1282.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1278.mgmt.eqiad.wmnet with reboot policy FORCED [14:35:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1285.mgmt.eqiad.wmnet with reboot policy FORCED [14:36:33] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp3081*} and A:cp for 9.2.5-1wm2 [14:37:17] !log zabe@deploy1003 Finished scap: update interwiki cache (duration: 07m 10s) [14:37:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1287.mgmt.eqiad.wmnet with reboot policy FORCED [14:37:41] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1291.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:21] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:23] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1288.mgmt.eqiad.wmnet with reboot policy FORCED [14:38:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10045289 (10Jhancock.wm) a:03Jhancock.wm [14:38:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Install (2) 960GB SSDs each in kafka-main20[06-10] - https://phabricator.wikimedia.org/T371423#10045294 (10Jhancock.wm) we can schedule this any time on Wednesday or Thursday this week (august 7th or 8th) or some time next week. Drives are set aside and ready to... [14:38:55] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1289.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1290.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:20] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1292.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:25] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1293.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1294.mgmt.eqiad.wmnet with reboot policy FORCED [14:39:53] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1295.mgmt.eqiad.wmnet with reboot policy FORCED [14:40:10] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:40:33] (03PS1) 10Dzahn: gerrit: switch nft throttling policy to drop [puppet] - 10https://gerrit.wikimedia.org/r/1060121 (https://phabricator.wikimedia.org/T365259) [14:41:58] <_joe_> !log repool cp4044 [14:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1275.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:27] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1274.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1272.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1270.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1273.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED [14:53:02] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns7002.wikimedia.org,service=recdns [reason: anycast-healthchecker 0.9.8 upgrade] [14:53:28] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns7002.wikimedia.org [reason: anycast-healthchecker 0.9.8 upgrade] [14:55:56] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: [done] anycast-healthchecker 0.9.8 upgrade] [14:56:40] !log disable puppet on A:dnsbox for cluster-wide anycast-hc 0.9.8 upgrade on remaining hosts: T370068 [14:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:43] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [14:58:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1287.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1286.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1288.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1289.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1290.mgmt.eqiad.wmnet with reboot policy FORCED [14:58:57] (03PS8) 10CDanis: haproxy: exclude some requests from concurrency tracking [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) [14:58:58] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1292.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:05] (03PS9) 10CDanis: haproxy: exclude some requests from concurrency tracking [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) [14:59:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1291.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:33] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1293.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1294.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1295.mgmt.eqiad.wmnet with reboot policy FORCED [15:00:04] eoghan, jelto, arnoldokoth, and mutante: SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1500). Please do the needful. [15:01:53] (03CR) 10CDanis: [C:03+2] haproxy: exclude some requests from concurrency tracking [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) (owner: 10CDanis) [15:01:55] !log disabling puppet on cp nodes to deploy https://gerrit.wikimedia.org/r/1059126 [15:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1296.mgmt.eqiad.wmnet with reboot policy FORCED [15:03:04] (03CR) 10RLazarus: [C:03+1] "Great catch, thanks for the fix!" [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:03:38] (03PS1) 10Ebernhardson: Enable NetworkSession extension for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060123 (https://phabricator.wikimedia.org/T355267) [15:04:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060123 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [15:06:42] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10045418 (10elukey) The issue is described in T371899. I proceeded anyway to upgrade both debmonitor server hosts, all good so far. Next step: upgrade the debm... [15:08:41] (03CR) 10Stevemunene: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059909 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [15:10:00] (03CR) 10Vgutierrez: [C:03+1] hieradata: Remove traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/1059943 (owner: 10BCornwall) [15:10:44] !log re-enabling puppet on cp nodes to deploy https://gerrit.wikimedia.org/r/1059126 [15:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:52] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:11:39] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:12:05] (03CR) 10Ahmon Dancy: deployment_server: mwscript_k8s uses report.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:12:48] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [15:13:51] (03PS3) 10Scott French: deployment_server: mwscript_k8s uses report.json [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) [15:14:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: ml-serve2001.codfw.wmnet: continued uncorrectable ECC errors - https://phabricator.wikimedia.org/T371872#10045443 (10Papaul) a:05Papaul→03None [15:14:29] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1004.wikimedia.org [reason: anycast-healthchecker 0.9.8 upgrade] [15:16:32] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1004.wikimedia.org [reason: [done] anycast-healthchecker 0.9.8 upgrade] [15:17:07] (03PS4) 10Scott French: deployment_server: mwscript_k8s uses report.json [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) [15:18:35] (03CR) 10Ahmon Dancy: deployment_server: mwscript_k8s uses report.json (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:18:37] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1005.wikimedia.org [reason: anycast-healthchecker 0.9.8 upgrade] [15:19:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:04] (03CR) 10Scott French: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [15:20:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [15:21:04] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1005.wikimedia.org [reason: [done] anycast-healthchecker 0.9.8 upgrade] [15:23:17] !log sukhe@cumin1002 conftool action : set/pooled=no; selector: name=dns1006.wikimedia.org [reason: anycast-healthchecker 0.9.8 upgrade] [15:23:22] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1060127 [15:23:32] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2035.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:23:36] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1023.eqiad.wmnet with OS bullseye [15:23:39] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1060127 [15:23:44] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060127 (owner: 10CDanis) [15:25:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2035.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:25:50] !log sukhe@cumin1002 conftool action : set/pooled=yes; selector: name=dns1006.wikimedia.org [reason: [done] anycast-healthchecker 0.9.8 upgrade] [15:25:55] (03PS3) 10CDanis: haproxy: fix action order [puppet] - 10https://gerrit.wikimedia.org/r/1060127 [15:26:02] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/output/1060127/1615/cp1100.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1060127 (owner: 10CDanis) [15:26:25] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:37] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:27:37] (03CR) 10Giuseppe Lavagetto: [C:03+1] haproxy: fix action order [puppet] - 10https://gerrit.wikimedia.org/r/1060127 (owner: 10CDanis) [15:27:48] (03CR) 10CDanis: [C:03+2] haproxy: fix action order [puppet] - 10https://gerrit.wikimedia.org/r/1060127 (owner: 10CDanis) [15:30:48] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:30:57] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:33:44] (03PS1) 10Jelto: gitlab: enable nft throttling on role level, but just log [puppet] - 10https://gerrit.wikimedia.org/r/1060131 (https://phabricator.wikimedia.org/T366882) [15:35:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:36:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1060131 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:36:51] (03CR) 10Jelto: [V:03+1] "Similar to gerrit, start with just logging and not dropping" [puppet] - 10https://gerrit.wikimedia.org/r/1060131 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [15:37:30] (03CR) 10Vgutierrez: [C:03+1] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 (owner: 10Ncmonitor) [15:39:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2010 to codfw - jhancock@cumin2002" [15:39:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2010 to codfw - jhancock@cumin2002" [15:39:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:28] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:43:23] (03PS4) 10Ahmon Dancy: Add new image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) [15:44:49] (03PS2) 10Andrew Bogott: nova-proxy api (invisible-unicorn.py): run through black [puppet] - 10https://gerrit.wikimedia.org/r/1060118 [15:46:05] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10045661 (10cmooney) 05Open→03Resolved Thanks guys, the second ports are now configured on the switches. [15:46:12] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2011 to codfw - jhancock@cumin2002" [15:46:13] (03CR) 10Andrew Bogott: [C:03+2] nova-proxy api (invisible-unicorn.py): run through black [puppet] - 10https://gerrit.wikimedia.org/r/1060118 (owner: 10Andrew Bogott) [15:46:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ml-serve2011 to codfw - jhancock@cumin2002" [15:46:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:50:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install cloudcephosd10[35-38] - https://phabricator.wikimedia.org/T363344#10045673 (10cmooney) I should say cloudcephosd1036 change I've not pushed to the switch - that will happen when we do a homer run after the planned reb... [15:54:53] (03PS2) 10Michael Große: fix(i18n): adjust broken mentorship eligibility copy [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060136 (https://phabricator.wikimedia.org/T371775) [15:57:09] 06SRE, 06Infrastructure-Foundations, 10netops: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10045702 (10cmooney) Just to update on the situation things remain stable since the changes earlier on. ` cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]" 10.64.... [15:58:06] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10045705 (10Jhancock.wm) [15:58:39] (03PS2) 10Michael Große: fix(i18n): adjust broken mentorship eligibility copy [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060139 (https://phabricator.wikimedia.org/T371775) [15:58:57] (03PS4) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [15:58:58] (03PS4) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [16:00:04] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:03:04] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424 [16:03:14] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [16:03:18] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1020.eqiad.wmnet with reason: Reimaging clouddb1020 T365424 [16:08:07] !log sudo cumin "A:dnsbox" "run-puppet-agent --enable 'upgrading anycast-hc'": finish anycast-hc upgrade [16:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:27] !log sudo cumin "A:dnsbox" "run-puppet-agent --enable 'upgrading anycast-hc'": finish anycast-hc upgrade: T370068 [16:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] T370068: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068 [16:08:51] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1020.eqiad.wmnet with OS bookworm [16:21:06] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1020.eqiad.wmnet with reason: host reimage [16:21:35] (03PS3) 10Ncmonitor: DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 [16:21:43] (03CR) 10BCornwall: [C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 (owner: 10Ncmonitor) [16:21:46] (03CR) 10BCornwall: [V:03+2 C:03+2] DNSRepository: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1055230 (owner: 10Ncmonitor) [16:23:58] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1020.eqiad.wmnet with reason: host reimage [16:27:28] (03CR) 10Herron: [C:03+1] benthos: use fully qualified kafka cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060070 (owner: 10Filippo Giunchedi) [16:27:40] (03CR) 10BCornwall: [C:03+2] hieradata: Remove traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/1059943 (owner: 10BCornwall) [16:28:07] (03CR) 10Herron: [C:03+1] webperf: use fully qualified kafka cluster names [puppet] - 10https://gerrit.wikimedia.org/r/1060069 (owner: 10Filippo Giunchedi) [16:28:38] (03CR) 10Herron: [C:03+1] pontoon: restore Benthos instances functionality [puppet] - 10https://gerrit.wikimedia.org/r/1060071 (owner: 10Filippo Giunchedi) [16:35:24] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:37:33] 06SRE-OnFire, 10Incident Tooling: corto: production deployment - https://phabricator.wikimedia.org/T370789#10045977 (10BCornwall) p:05Triage→03Medium a:03BCornwall [16:39:23] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding payments200 to codfw - jhancock@cumin2002" [16:39:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding payments200 to codfw - jhancock@cumin2002" [16:39:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:41:52] (03PS20) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [16:42:40] (03CR) 10CI reject: [V:04-1] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [16:44:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1023.eqiad.wmnet with OS bullseye [16:44:42] (03PS21) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [16:45:38] (03CR) 10CI reject: [V:04-1] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [16:48:38] (03PS22) 10Effie Mouzeli: cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) [16:49:41] (03CR) 10CI reject: [V:04-1] cronjobs : update modules to job 2.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049573 (https://phabricator.wikimedia.org/T356885) (owner: 10Effie Mouzeli) [16:51:52] (03CR) 10Ssingh: [C:03+1] ncmonitor: Enable patches, email; Set monthly [puppet] - 10https://gerrit.wikimedia.org/r/1056567 (owner: 10BCornwall) [16:52:43] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10046046 (10Jhancock.wm) a:05Jhancock.wm→03Papaul @papaul ready for your part. NAME: payments2004 ETH1 <> FASW-C8A eth-0/0/8 ETH2 <> FASW-C8B eth-1/0/8 NAME: paymen... [16:55:09] (03CR) 10BCornwall: [C:03+2] ncmonitor: Enable patches, email; Set monthly [puppet] - 10https://gerrit.wikimedia.org/r/1056567 (owner: 10BCornwall) [16:56:44] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1020.eqiad.wmnet with OS bookworm [16:58:20] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10046062 (10ssingh) 05Open→03Resolved We have upgraded all DNS boxes, Wikimedia DNS and durum hosts to the latest version of anycast-he... [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1700) [17:01:05] (03PS1) 10David Caro: ceph: add new cloudcephosd1035 [puppet] - 10https://gerrit.wikimedia.org/r/1060146 [17:01:25] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [17:01:29] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [17:04:34] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) (owner: 10DCausse) [17:11:01] (03PS2) 10David Caro: ceph: add new cloudcephosd1035 [puppet] - 10https://gerrit.wikimedia.org/r/1060146 (https://phabricator.wikimedia.org/T363344) [17:14:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060139 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [17:15:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060136 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [17:19:06] (03PS1) 10Urbanecm: [Growth] dewiki: Enable frontend for Add Link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060148 (https://phabricator.wikimedia.org/T371597) [17:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:24:28] ryankemper: ^ known? [17:24:37] (03PS1) 10DCausse: search: add search update lag SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1060150 (https://phabricator.wikimedia.org/T328330) [17:25:20] (03CR) 10Scott French: "Thanks, Ahmon! One question." [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [17:26:52] (03CR) 10Dzahn: [V:03+1] ci: add new ECDSA ssh key for jenkins to connect to itself (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [17:28:25] (03PS1) 10Ssingh: depool eqsin: emergency patch (do not merge unless required) [dns] - 10https://gerrit.wikimedia.org/r/1060151 [17:29:26] (03CR) 10Ahmon Dancy: Add new image building command for mwbuilder sudo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [17:29:57] (03PS1) 10Stoyofuku-wmf: Promote dark mode for anons on various wikis - take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060152 (https://phabricator.wikimedia.org/T371070) [17:35:58] jouncebot nowandnext [17:35:58] For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1700) [17:35:58] In 0 hour(s) and 24 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1800) [17:39:18] (03CR) 10Dzahn: [V:03+1 C:03+2] ci: add new ECDSA ssh key for jenkins to connect to itself [puppet] - 10https://gerrit.wikimedia.org/r/1059106 (https://phabricator.wikimedia.org/T177826) (owner: 10Dzahn) [17:40:27] !log CI - adding a new SSH key to jenkins - in the same file without removing the old key yet - this is expected to have no effect, but if CI breaks will revert - T177826 [17:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:30] T177826: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826 [17:40:52] (03CR) 10Scott French: [C:03+2] deployment_server: mwscript_k8s uses report.json [puppet] - 10https://gerrit.wikimedia.org/r/1059956 (https://phabricator.wikimedia.org/T341553) (owner: 10Scott French) [17:41:09] eh, jouncebot decided to say that but then quit? [17:42:08] sukhe: known, I’m working on fixing monitoring, the underlying service is healthy [17:42:18] thanks ryankemper [17:47:48] !log stop pybal on lvs5004 for server reboot [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:24] does something that SHOULD be ok but has a small risk that it breaks CI [17:48:31] (03CR) 10Ryan Kemper: [C:03+2] wdqs: set proper kafka topics for main and scholarly [puppet] - 10https://gerrit.wikimedia.org/r/1060049 (https://phabricator.wikimedia.org/T364366) (owner: 10DCausse) [17:48:33] will confirm in a second that jenkins is still working [17:49:12] conflicts with Ryan [17:49:49] mutante: I merged both, hope that's ok [17:50:01] ryankemper: yes, it is. thanks! [17:50:05] (03CR) 10Scott French: [C:03+1] Add new image building command for mwbuilder sudo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (https://phabricator.wikimedia.org/T371904) (owner: 10Ahmon Dancy) [17:50:07] checking my part [17:50:30] and .. it fails :) [17:50:37] worked in compiler.. sigh [17:50:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet [17:52:06] ah.. it works on the next puppet run ..interesting [17:52:37] (03CR) 10Ryan Kemper: [C:03+2] wdqs: add graph split type to blackbox probe alert [puppet] - 10https://gerrit.wikimedia.org/r/1059909 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [17:53:40] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [17:53:42] mutante: did it fail or just make no difference on first puppet run? it took ~30s after my irc message for the puppet merge to fully complete, so if the latter that would be why [17:53:50] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1060151 (owner: 10Ssingh) [17:53:52] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet [17:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:29] ryankemper: puppet run failed on one of 2 machines with a "file not found" kind of thing. then on the next run it worked [17:54:47] should be ok now [17:54:52] mutante: CI seems to be working on the recheck above [17:55:00] sukhe: great, thanks [17:55:12] (finished) [17:55:12] so what I did was I added a new SSH key to jenkins [17:55:18] which jenkins uses to connect to itself and agents [17:55:28] it's a single config file that holds 2 keys now, old and new [17:55:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10046293 (10KFrancis) Hi all, the NDA is out for signatures. I'll confirm when it's complete. [17:55:50] and it was expected that this is ok. and it still works, so alright [17:56:34] (03PS5) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [17:56:51] mutante: thanks! [17:57:44] (03PS5) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [17:57:44] (03PS6) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:00:05] jnuche and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T1800). [18:01:28] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, and 2 others: Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10046298 (10Dzahn) @hashar The new private key has been added to the jenkins credentials store, twice, with the 2 different user... [18:02:49] (03PS6) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [18:02:49] (03PS7) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:03:21] (03CR) 10CI reject: [V:04-1] dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [18:04:50] (03PS7) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [18:04:50] (03PS8) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:06:17] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920 (10RobH) 03NEW [18:06:40] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046349 (10RobH) [18:07:01] o/ - nothing to do for this window. [18:09:02] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046361 (10RobH) [18:09:03] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: Q1:rack/setup/install ml-serve20[09-11] - https://phabricator.wikimedia.org/T371920#10046352 (10RobH) a:05Jhancock.wm→03klausman >>! In T366521#10045581, @Jhancock.wm wrote: > these servers are racked. and I'll have them all pingable on the mgmt network in... [18:10:26] I'm going to use the window to test update to the mediawiki container image build code. [18:12:12] cool [18:13:22] !log dancy@deploy1003 Started scap sync-world: testing T370934 [18:13:25] T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions - https://phabricator.wikimedia.org/T370934 [18:13:50] oops. wrong bug reference on that announcement. [18:14:00] but slightly related so no big deal. :-) [18:15:56] (03PS8) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [18:15:56] (03PS9) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:18:16] !log stop pybal on lvs5005 for server reboot [18:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:51] 10ops-codfw, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T371923 (10phaultfinder) 03NEW [18:28:19] !log sudo cumin "lvs6001*" 'disable-puppet "rebooting" && systemctl stop pybal.service' [18:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:41] (03PS2) 10Stoyofuku-wmf: Promote dark mode for anons on various wikis - take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060152 (https://phabricator.wikimedia.org/T371070) [18:32:45] FIRING: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [18:33:12] huh [18:33:16] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet [18:36:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet [18:37:16] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10046443 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member "ge-[0-1]/0/8"; - member "ge-[0-1]/0/9"; [edit interf... [18:37:46] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q#:rack/setup/install payments200[456] - https://phabricator.wikimedia.org/T369942#10046444 (10Papaul) @Jhancock.wm switch configuration done [18:40:44] (03PS9) 10Andrew Bogott: dynamic proxy: Add an endpoint for scrubbing out nonexistent backends [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) [18:40:44] (03PS10) 10Andrew Bogott: wmf_sink: replace targetted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [18:41:02] !log start pybal on lvs5005 [18:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:54] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1174 - https://phabricator.wikimedia.org/T371927 (10ops-monitoring-bot) 03NEW [18:43:54] 10ops-codfw, 06SRE, 06DC-Ops: Port with no description on access switch - https://phabricator.wikimedia.org/T371923#10046471 (10phaultfinder) [18:44:27] !log dancy@deploy1003 Finished scap: testing T370934 (duration: 31m 05s) [18:44:29] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet [18:44:30] T370934: Build and publish multiple MediaWiki production images for a given set of PHP versions - https://phabricator.wikimedia.org/T370934 [18:44:43] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10046473 (10Papaul) 05Resolved→03Open a:05Jhancock.wm→03Dwisehaupt @Dwisehaupt this server still has it's DNS entries [18:45:22] !log dancy@deploy1003 Started scap sync-world: testing T371904 [18:45:24] T371904: Rewrite remaining make-container-image code in Python - https://phabricator.wikimedia.org/T371904 [18:45:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10046477 (10Papaul) 05Resolved→03Open a:05Jhancock.wm→03Dwisehaupt @Dwisehaupt this server still has it's DNS entries [18:47:17] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet [18:47:31] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:48:06] !log re-enable pybal on lvs6001 [18:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:37] !log dancy@deploy1003 Finished scap: testing T371904 (duration: 04m 14s) [18:49:52] (03Abandoned) 10Ssingh: depool eqsin: emergency patch (do not merge unless required) [dns] - 10https://gerrit.wikimedia.org/r/1060151 (owner: 10Ssingh) [18:50:53] I'm done w/ scap testing. [18:52:31] RESOLVED: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:52:45] RESOLVED: Traffic bill over quota: Alert for device cr2-eqsin.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [18:53:54] (03PS1) 10Ryan Kemper: wdqs: set appropriate graph split cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060158 (https://phabricator.wikimedia.org/T364366) [18:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:55:04] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060158 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [18:55:13] (03PS2) 10Ryan Kemper: wdqs: set appropriate graph split cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060158 (https://phabricator.wikimedia.org/T364366) [18:55:14] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060158 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [18:55:51] (03CR) 10Bernard Wang: [C:03+1] Promote dark mode for anons on various wikis - take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060152 (https://phabricator.wikimedia.org/T371070) (owner: 10Stoyofuku-wmf) [18:57:09] !log sudo cumin "lvs4008*" 'disable-puppet "rebooting" && systemctl stop pybal.service' [18:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:29] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2009 [19:03:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2009 [19:03:45] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2010 [19:03:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2010 [19:03:59] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-serve2011 [19:04:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-serve2011 [19:05:12] (03CR) 10Ryan Kemper: [C:03+2] wdqs: set appropriate graph split cluster name [puppet] - 10https://gerrit.wikimedia.org/r/1060158 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [19:13:26] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs4008.ulsfo.wmnet [19:16:18] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4008.ulsfo.wmnet [19:19:03] !log restart varnishmtail on cp3070 [19:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:59] !log start pybal on lvs4008 [19:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:16] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [19:35:26] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 09s) [19:37:46] jouncebot: next [19:37:46] In 0 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T2000) [19:38:23] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [19:38:42] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 18s) [19:42:02] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [19:42:19] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 16s) [19:44:27] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10046567 (10hashar) I have changed the key of the contint1002 agent (via the [[ https://integration.wikim... [19:44:34] !log ebernhardson@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [19:44:58] !log ebernhardson@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [19:49:26] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [19:49:28] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 02s) [19:49:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [19:55:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:57:03] !log stop pybal on lvs4009 for server reboot [19:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240806T2000). [20:00:04] toyofuku, ebernhardson, and MichaelG_WMF: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] * MichaelG_WMF is here [20:02:25] \o [20:02:29] Here! Can't deploy for myself (quite) yet so thank you in advance to whoever ends up deploying my code (: [20:02:47] I can deploy [20:02:56] Give me a moment to set up [20:03:00] ty ty [20:03:04] kindrobot: YaY! Thanks! [20:03:18] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060161 [20:04:13] I'm going to deploy all three of your patches at the same time and then wait for you to all confirm before syncing [20:04:31] kindrobot: my two changes are sadly i18n changes and so they will take some time [20:04:55] also, they will have to pass CI first and will probably be much slower than config changes there as well [20:05:40] I have an interview in an hour, so as long as I'm able to test before then that should be fine [20:06:31] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 07Jenkins, 10Release-Engineering-Team (Seen): Upgrade CI Jenkins ssh key to ecdsa - https://phabricator.wikimedia.org/T177826#10046603 (10hashar) [20:06:48] * cjming waves and bows to kindrobot [20:08:10] ack [20:09:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060152 (https://phabricator.wikimedia.org/T371070) (owner: 10Stoyofuku-wmf) [20:09:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060123 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:09:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060139 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [20:09:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kindrobot@deploy1003 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060136 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [20:10:08] (03Merged) 10jenkins-bot: Promote dark mode for anons on various wikis - take 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060152 (https://phabricator.wikimedia.org/T371070) (owner: 10Stoyofuku-wmf) [20:10:11] (03Merged) 10jenkins-bot: Enable NetworkSession extension for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060123 (https://phabricator.wikimedia.org/T355267) (owner: 10Ebernhardson) [20:11:41] awaiting-backport-merges [20:21:49] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs4009.ulsfo.wmnet [20:22:07] Note that based on previous experience, Zuul's estimated runtime of 22 Minutes for the wmf-gate-and-submit pipeline might be very optimistic. The original change took 36 minutes for the PHP7.4 job. [20:23:35] MichaelG_WMF: do the merge pipelines happen sequentially or in parallel? [20:24:29] I'm no deployer myself. The two config changes from ebernhardson and toyofuku are long merged, I think. [20:24:46] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4009.ulsfo.wmnet [20:25:15] Yeah, that's what I'm seeing: [20:25:17] 20:25:03 awaiting-backport-merges: 50% (ok: 2; fail: 0; left: 2) - [20:25:50] I don't think they're deployed to the test wikis until all of them are merged because I decided to do them all together [20:25:58] I do not know what buttons one would need to press to make the two go forward on their own. Can't help there, sorry [20:26:33] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [20:27:21] !log start pybal on lvs4009 [20:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:29] I might've just made the wrong call to try to get them all through at once. Hopefully we wrap before the top of the hour [20:28:39] Y'all didn't know you were getting the "B team" today :P [20:29:39] hahaha [20:29:50] All good from my end - it should be relatively quick to test [20:30:25] Looks like they are running in parallel [20:30:26] If it gets close to the top of the hour I'll find someone to do their best Steph impression here while I go to my interview [20:30:57] toyofuku: didn't know your name was Steph! I'm Stef :) [20:31:23] hellooo [20:31:52] I'm stef any time I order a drink at a coffeeshop and they ask for my name [20:34:11] Ha! Funny, I always end up Steph at Starbucks [20:43:24] Peeking at the logs as they're being written, the changes are now at 81% of the last PHPUnit test set. So maybe 5 more minutes [20:45:37] 🤞🤞🤞 [20:46:42] (03PS4) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [20:47:23] (03Merged) 10jenkins-bot: fix(i18n): adjust broken mentorship eligibility copy [extensions/GrowthExperiments] (wmf/1.43.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1060139 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [20:47:26] (03Merged) 10jenkins-bot: fix(i18n): adjust broken mentorship eligibility copy [extensions/GrowthExperiments] (wmf/1.43.0-wmf.16) - 10https://gerrit.wikimedia.org/r/1060136 (https://phabricator.wikimedia.org/T371775) (owner: 10Michael Große) [20:47:40] finally! [20:47:48] !log kindrobot@deploy1003 Started scap sync-world: Backport for [[gerrit:1060152|Promote dark mode for anons on various wikis - take 2 (T371070 T371084)]], [[gerrit:1060123|Enable NetworkSession extension for most wikis (T355267)]], [[gerrit:1060139|fix(i18n): adjust broken mentorship eligibility copy (T371775 T370318)]], [[gerrit:1060136|fix(i18n): adjust broken mentorship eligibility copy (T371775 T370318)]] [20:47:58] T371070: Re-evaluate tier 3 wikis that are ready for dark mode - https://phabricator.wikimedia.org/T371070 [20:47:58] T371084: Deploy dark mode to eswiki and other "ready" wikis - https://phabricator.wikimedia.org/T371084 [20:47:58] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [20:47:59] T371775: Permission error when attempting to enroll as mentor on Turkish Wikipedia - https://phabricator.wikimedia.org/T371775 [20:47:59] T370318: GEMentorshipAutomaticEligibility description is wrong - https://phabricator.wikimedia.org/T370318 [20:48:37] (03CR) 10Dzahn: [C:04-1] ":( " The value '10.64.48.113' cannot be converted to Numeric. "" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [20:51:08] We're almost synced to the test servers [20:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:03] Hmm, some slowness while building the container images. toyofuku you might want to appoint an alternate ;) [20:56:14] Sounds good [20:56:50] !log UTC late backport window, deploy is extending beyond deployment window [20:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:08] (03PS5) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [20:59:21] (03PS6) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [20:59:30] If someone's available to fill for me they'll jump in shortly - if not, feel free to keep the deploy going and I'll test very thoroughly after I get out in an hour [20:59:34] !log stop pybal on lvs6002 for server reboot [20:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:44] Sorry for the trouble [21:00:34] That slowness might be the fault of my changes. Last time there was the speculation that rebuilding the localisation caches takes time 😔 [21:00:44] Also sorry for the trouble :/ [21:01:05] toyofuku: do you foresee any risks if you don't test it before sync? [21:01:50] rebuilding localization always takes a long time, don't need to speculate :) [21:02:37] I'm very sorry team, but if toyofuku's alternate doesn't show, I'll likely cancel the deploy [21:02:42] (03CR) 10Andrew Bogott: "Tested + works" [puppet] - 10https://gerrit.wikimedia.org/r/1059958 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [21:02:55] But we still haven't synced yet [21:03:32] (03PS7) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [21:04:11] (03CR) 10Andrew Bogott: "This only sort of works. The designate callback is in a race with VM deletion; sometimes the VM is still in state Active when we do the cl" [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [21:05:28] (03PS8) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [21:14:30] (03PS11) 10Andrew Bogott: wmf_sink: replace targeted proxy cleanup with project-wide cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) [21:14:31] (03PS1) 10Andrew Bogott: wmfsink: hook delete.end rather than delete.start [puppet] - 10https://gerrit.wikimedia.org/r/1060172 (https://phabricator.wikimedia.org/T371707) [21:15:17] how long can that cache take to rebuild [21:15:38] 21:14:18 scap-cdb-rebuild: 75% (in-flight: 1; ok: 3; fail: 0; left: 0) [21:16:09] !log brett@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet [21:17:07] Thanks for the update 🙏 [21:17:38] I've thought about streaming my terminal while deploying, but I'm not sure if there are security implications [21:18:27] I would watch that, but I can see how it might be sensitive [21:18:58] maybe I'll ask th.cipriani next time I see him [21:18:59] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet [21:20:14] MichaelG_WMF: just curious, why backport a i18n change instead of letting it go with the train? [21:20:28] kindrobot: I can fill in for toyofuku if the deploy is still happening [21:20:39] jan_drewniak: it is and thank you [21:21:24] !log start pybal on lvs6002 [21:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:27] !log kindrobot@deploy1003 toyofuku, ebernhardson, kindrobot, migr: Backport for [[gerrit:1060152|Promote dark mode for anons on various wikis - take 2 (T371070 T371084)]], [[gerrit:1060123|Enable NetworkSession extension for most wikis (T355267)]], [[gerrit:1060139|fix(i18n): adjust broken mentorship eligibility copy (T371775 T370318)]], [[gerrit:1060136|fix(i18n): adjust broken mentorship eligibility copy (T371775 T37031 [21:21:27] 8)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:35] T371070: Re-evaluate tier 3 wikis that are ready for dark mode - https://phabricator.wikimedia.org/T371070 [21:21:36] T371084: Deploy dark mode to eswiki and other "ready" wikis - https://phabricator.wikimedia.org/T371084 [21:21:36] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [21:21:36] T371775: Permission error when attempting to enroll as mentor on Turkish Wikipedia - https://phabricator.wikimedia.org/T371775 [21:21:37] T370318: GEMentorshipAutomaticEligibility description is wrong - https://phabricator.wikimedia.org/T370318 [21:21:37] T37031: Change decimal separators for numbers in Kurdish - https://phabricator.wikimedia.org/T37031 [21:21:42] kindrobot: a wrong message on a control for CommunityConfiguration. It makes it sound like the control is doing something that it doesn't (enroll mentors automatically) [21:22:31] ah, ok [21:22:33] that is already causing problems, because disabling that checkbox means that most editors-wanting-to-be-mentors can't enroll themselves anymore [21:22:47] wish I had learned of that earlier. Like 2 months ago :/ [21:23:10] MichaelG_WMF, jan_drewniak, ebernhardson: please confirm changes [21:23:17] looking! [21:23:54] (03CR) 10Andrew Bogott: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060172 fixes the race by waiting until the end of deletion to cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/1059959 (https://phabricator.wikimedia.org/T371707) (owner: 10Andrew Bogott) [21:24:15] kindrobot: awesome, good to sync here [21:24:37] kindrobot: mine looks good [21:24:48] ty ty [21:25:07] mine look good ! [21:25:36] Great, syncing [21:25:39] !log kindrobot@deploy1003 toyofuku, ebernhardson, kindrobot, migr: Continuing with sync [21:34:53] !log kindrobot@deploy1003 Finished scap: Backport for [[gerrit:1060152|Promote dark mode for anons on various wikis - take 2 (T371070 T371084)]], [[gerrit:1060123|Enable NetworkSession extension for most wikis (T355267)]], [[gerrit:1060139|fix(i18n): adjust broken mentorship eligibility copy (T371775 T370318)]], [[gerrit:1060136|fix(i18n): adjust broken mentorship eligibility copy (T371775 T370318)]] (duration: 47m 05s) [21:35:00] T371070: Re-evaluate tier 3 wikis that are ready for dark mode - https://phabricator.wikimedia.org/T371070 [21:35:00] T371084: Deploy dark mode to eswiki and other "ready" wikis - https://phabricator.wikimedia.org/T371084 [21:35:01] T355267: Add extension NetworkSession to all wmf wikis - https://phabricator.wikimedia.org/T355267 [21:35:01] T371775: Permission error when attempting to enroll as mentor on Turkish Wikipedia - https://phabricator.wikimedia.org/T371775 [21:35:02] T370318: GEMentorshipAutomaticEligibility description is wrong - https://phabricator.wikimedia.org/T370318 [21:35:17] !log UTC late backport window finished <3 [21:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:32] Yay, thank you so much kindrobot ! [21:35:45] np, thank you [21:59:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:17] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:34:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:38:34] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [22:41:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1271 - jclark@cumin1002" [22:41:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker1271 - jclark@cumin1002" [22:41:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:42:07] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1271.mgmt.eqiad.wmnet with reboot policy FORCED [22:45:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1270.eqiad.wmnet with OS bullseye [22:45:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1270.eqiad.wmnet with OS bull... [22:46:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1272.eqiad.wmnet with OS bullseye [22:46:21] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046982 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1272.eqiad.wmnet with OS bull... [22:46:25] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1273.eqiad.wmnet with OS bullseye [22:46:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046983 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1273.eqiad.wmnet with OS bull... [22:47:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1274.eqiad.wmnet with OS bullseye [22:47:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1274.eqiad.wmnet with OS bull... [22:47:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1275.eqiad.wmnet with OS bullseye [22:47:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1276.eqiad.wmnet with OS bullseye [22:47:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046985 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1275.eqiad.wmnet with OS bull... [22:47:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1276.eqiad.wmnet with OS bull... [22:47:44] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1277.eqiad.wmnet with OS bullseye [22:47:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1277.eqiad.wmnet with OS bull... [22:48:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1278.eqiad.wmnet with OS bullseye [22:48:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046988 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1278.eqiad.wmnet with OS bull... [22:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:01:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1271.mgmt.eqiad.wmnet with reboot policy FORCED [23:02:10] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage [23:02:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1271.eqiad.wmnet with OS bullseye [23:02:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10046992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1271.eqiad.wmnet with OS bull... [23:02:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage [23:03:09] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [23:03:37] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage [23:04:13] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage [23:04:23] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [23:04:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [23:04:56] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1278.eqiad.wmnet with reason: host reimage [23:05:39] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1270.eqiad.wmnet with reason: host reimage [23:06:26] (03PS1) 10Ahmon Dancy: php7.4-fpm-multiversion-base: Fix a couple of typos [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1060182 [23:08:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1275.eqiad.wmnet with reason: host reimage [23:10:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1278.eqiad.wmnet with reason: host reimage [23:13:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1273.eqiad.wmnet with reason: host reimage [23:17:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1277.eqiad.wmnet with reason: host reimage [23:18:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage [23:20:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1276.eqiad.wmnet with reason: host reimage [23:22:15] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:22:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:22:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1270.eqiad.wmnet with OS bullseye [23:22:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047045 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1270.eqiad.wmnet with OS bullseye... [23:24:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1272.eqiad.wmnet with reason: host reimage [23:24:52] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:27:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:27:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1275.eqiad.wmnet with OS bullseye [23:27:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1271.eqiad.wmnet with reason: host reimage [23:28:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047046 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1275.eqiad.wmnet with OS bullseye... [23:28:09] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:32:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1274.eqiad.wmnet with reason: host reimage [23:33:13] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:33:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047050 (10Jclark-ctr) [23:33:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:33:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1278.eqiad.wmnet with OS bullseye [23:33:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047051 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1278.eqiad.wmnet with OS bullseye... [23:33:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:33:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1277.eqiad.wmnet with OS bullseye [23:33:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047052 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1277.eqiad.wmnet with OS bullseye... [23:34:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:34:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:34:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1273.eqiad.wmnet with OS bullseye [23:34:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047053 (10Jclark-ctr) [23:34:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1273.eqiad.wmnet with OS bullseye... [23:37:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:38:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:38:16] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1276.eqiad.wmnet with OS bullseye [23:38:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047055 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1276.eqiad.wmnet with OS bullseye... [23:38:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060183 [23:38:37] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060183 (owner: 10TrainBranchBot) [23:38:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047056 (10Jclark-ctr) [23:39:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1279.eqiad.wmnet with OS bullseye [23:40:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047058 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1279.eqiad.wmnet with OS bull... [23:40:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1280.eqiad.wmnet with OS bullseye [23:40:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047059 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1280.eqiad.wmnet with OS bull... [23:40:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1281.eqiad.wmnet with OS bullseye [23:40:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1281.eqiad.wmnet with OS bull... [23:40:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1282.eqiad.wmnet with OS bullseye [23:40:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047061 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1282.eqiad.wmnet with OS bull... [23:40:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1284.eqiad.wmnet with OS bullseye [23:40:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047062 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1284.eqiad.wmnet with OS bull... [23:41:09] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1283.eqiad.wmnet with OS bullseye [23:41:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1283.eqiad.wmnet with OS bull... [23:41:17] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:43:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:43:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1272.eqiad.wmnet with OS bullseye [23:43:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1272.eqiad.wmnet with OS bullseye... [23:43:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047065 (10Jclark-ctr) [23:44:22] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:46:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:46:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1271.eqiad.wmnet with OS bullseye [23:46:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047069 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1271.eqiad.wmnet with OS bullseye... [23:46:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047070 (10Jclark-ctr) [23:47:19] (03PS1) 10RLazarus: mediawiki: Bump ttlSecondsAfterFinished for Jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060184 [23:49:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1285.eqiad.wmnet with OS bullseye [23:49:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:49:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1286.eqiad.wmnet with OS bullseye [23:49:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047082 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1285.eqiad.wmnet with OS bull... [23:49:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047083 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1286.eqiad.wmnet with OS bull... [23:49:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [23:49:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1274.eqiad.wmnet with OS bullseye [23:49:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047084 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1274.eqiad.wmnet with OS bullseye... [23:50:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10047085 (10Jclark-ctr) [23:56:50] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1279.eqiad.wmnet with reason: host reimage [23:57:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage [23:57:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage [23:57:19] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage [23:57:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage [23:57:45] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage