[00:04:56] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2094:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1075334 (owner: 10TrainBranchBot) [00:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:12:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 810.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:13:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [01:17:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 810.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:01:03] (03PS1) 10RLazarus: deployment_server: More mwscript-k8s usability tweaks [puppet] - 10https://gerrit.wikimedia.org/r/1075346 [02:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:56] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker2094:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:56] FIRING: SystemdUnitFailed: build-homepage.service on registry2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:57:32] 06SRE, 06Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645#10174094 (10Cpetrillo) This would be very useful for us to be able to understand if known problematic reusers (see: https://phabric... [03:59:24] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-09-19-120927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074187 (owner: 10KartikMistry) [03:59:38] ^ Deploying MinT [04:00:27] (03Merged) 10jenkins-bot: Update MinT to 2024-09-19-120927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1074187 (owner: 10KartikMistry) [04:01:25] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [04:04:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:57] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [04:09:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:05] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [04:14:56] RESOLVED: [2x] SystemdUnitFailed: build-homepage.service on registry2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:24] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [04:23:45] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [04:32:52] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [04:38:48] ૅ!log Updated MinT to 2024-09-19-120927-production [04:39:14] !log Updated MinT to 2024-09-19-120927-production [04:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [05:32:26] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10174130 (10ABran-WMF) a:05VRiley-WMF→03ABran-WMF great @Jclark-ctr thanks! will take it over from here :-) [05:38:19] 06SRE, 06DBA: Parsercache primary master databases should monitor replication - https://phabricator.wikimedia.org/T375395#10174139 (10ABran-WMF) 05Open→03Resolved a:03ABran-WMF @jcrespo I'll close this ticket as described in T373037#10174135, both task have been tied together. [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0600) [06:05:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter2006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:47] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops, and 2 others: Post pc1013 crash - https://phabricator.wikimedia.org/T375382#10174150 (10ABran-WMF) p:05High→03Medium [06:41:54] (03PS1) 10Muehlenhoff: Remove LDAP access for rudolphampofo [puppet] - 10https://gerrit.wikimedia.org/r/1075356 [06:48:01] (03PS1) 10Slyngshede: Minor UI tweaks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 [06:50:49] (03PS1) 10Slyngshede: P:idm Gitlab API is https. [puppet] - 10https://gerrit.wikimedia.org/r/1075433 [06:51:40] !log installing gnutls security updates on bullseye/bookworm [06:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:59] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: exclude kibana7 [cookbooks] - 10https://gerrit.wikimedia.org/r/1075314 (https://phabricator.wikimedia.org/T375544) (owner: 10Scott French) [06:53:48] (03PS2) 10Slyngshede: Minor UI tweaks, fix Gerrit blocking bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 [06:54:27] (03CR) 10Slyngshede: [C:03+2] P:idm Gitlab API is https. [puppet] - 10https://gerrit.wikimedia.org/r/1075433 (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0700). Please do the needful. [07:00:06] abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:00] I can deploy abijeet's patch. [07:01:11] hello [07:01:35] abijeet: hola. I'll ping once patch is to test on mwdebug servers. [07:01:39] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590 (10MoritzMuehlenhoff) 03NEW [07:01:42] (03CR) 10Brouberol: [C:03+1] "LG!" [puppet] - 10https://gerrit.wikimedia.org/r/1075321 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:01:46] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590#10174190 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:01:47] kart_, thanks! [07:02:19] (03CR) 10Brouberol: [C:03+1] "Perfect" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075278 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:02:23] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [07:02:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:03:37] (03Merged) 10jenkins-bot: Enable message group subscription feature for Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:04:02] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 (owner: 10Slyngshede) [07:04:20] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1072166|Enable message group subscription feature for Test Wikipedia (T372386)]] [07:04:25] (03CR) 10Slyngshede: [C:03+2] Minor UI tweaks, fix Gerrit blocking bug. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075432 (owner: 10Slyngshede) [07:04:27] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [07:06:35] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1072166|Enable message group subscription feature for Test Wikipedia (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:07:27] (03CR) 10Jelto: [C:03+1] "lgtm. When this behaves as expected in WMCS we can also remove this config for the production gitlab-runners" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [07:08:57] abijeet: available on mwdebug for testing. [07:09:12] kart_, checking [07:11:35] (03CR) 10Jelto: [C:03+2] Revert "prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points" [puppet] - 10https://gerrit.wikimedia.org/r/1075242 (https://phabricator.wikimedia.org/T375488) (owner: 10JMeybohm) [07:17:45] abijeet: all OK? [07:18:46] kart_, not seeing the button to subscribe appear. [07:19:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [07:19:57] oh! [07:24:01] (03PS1) 10Slyngshede: Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 [07:25:07] abijeet: I see error: messagegroupsubscription: Failed to fetch user subscriptions internal_api_error_DBQueryError {action: 'query', list: 'messagegroupsubscription', formatversion: 2} [07:26:55] kart_, the configuration change seems to have been deployed fine. I'm doing some debugging...will ping you in a bit [07:27:12] mw.config.get( 'wgTranslateEnableMessageGroupSubscription' ) returns true [07:27:12] (03CR) 10Slyngshede: [C:03+2] Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 (owner: 10Slyngshede) [07:27:44] Yes [07:29:56] (03Merged) 10jenkins-bot: Undo async setting for account blocking. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075499 (owner: 10Slyngshede) [07:31:21] kart_, I see a database query error in the console. Lets revert the change [07:32:22] OK! [07:32:52] !log kartik@deploy1003 Sync cancelled. [07:32:59] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [07:33:39] (03CR) 10Muehlenhoff: [C:03+2] Remove LDAP access for rudolphampofo [puppet] - 10https://gerrit.wikimedia.org/r/1075356 (owner: 10Muehlenhoff) [07:33:46] (03PS1) 10TrainBranchBot: Revert "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 [07:33:46] (03CR) 10TrainBranchBot: "kartik@deploy1003 created a revert of this change as I6f78b7a102ae9f6507e54866b7824fa82eafad5b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1072166 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [07:34:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [07:34:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 (owner: 10TrainBranchBot) [07:34:32] kart_, thanks! [07:35:10] (03Merged) 10jenkins-bot: Revert "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075501 (owner: 10TrainBranchBot) [07:35:29] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] [07:36:40] (03CR) 10Vgutierrez: [C:04-1] "this should be split by service and the CDN shouldn't be a part of it since we are doing a progressive deprecation there." [puppet] - 10https://gerrit.wikimedia.org/r/1075326 (https://phabricator.wikimedia.org/T375569) (owner: 10BCornwall) [07:36:44] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [07:37:26] !log kartik@deploy1003 kartik, trainbranchbot: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:37:36] !log kartik@deploy1003 kartik, trainbranchbot: Continuing with sync [07:37:40] !log restarting slapd on r/w LDAP servers to pick up GNUTLS security updates [07:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [07:39:01] Interesting, if we deploy revert changes only deployed to mwdebug servers with scap backport --revert, do we need to do full deployment? :) [07:40:43] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10174244 (10MoritzMuehlenhoff) [07:42:18] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1075501|Revert "Enable message group subscription feature for Test Wikipedia"]] (duration: 06m 48s) [07:43:24] (03PS1) 10Elukey: docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 [07:44:27] (03CR) 10Ayounsi: [C:03+2] Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237 (owner: 10Ayounsi) [07:44:32] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4116/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:44:59] (03Merged) 10jenkins-bot: Allow prometheus hosts to reach gnmi port [homer/public] - 10https://gerrit.wikimedia.org/r/1075237 (owner: 10Ayounsi) [07:45:02] (03CR) 10CI reject: [V:04-1] docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:45:04] (03PS1) 10Slyngshede: Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 [07:45:05] (03CR) 10DCausse: [C:03+2] rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [07:46:06] (03PS2) 10Elukey: docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 [07:46:08] (03Merged) 10jenkins-bot: rdf-streaming-updater: use SSL and external-services fqdn to access kafka-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1072231 (https://phabricator.wikimedia.org/T333373) (owner: 10DCausse) [07:47:23] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [07:47:57] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [07:48:06] (03CR) 10Hashar: [C:03+1] "I have a simple change I can deploy to validate everything works fine: https://gerrit.wikimedia.org/r/c/integration/docroot/+/1071197" [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [07:51:21] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [07:51:50] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [07:52:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:52:18] (03PS1) 10Jelto: gitlab: test defs_from_etcd on the replica [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) [07:54:20] (03CR) 10Muehlenhoff: [C:03+2] deployment servers: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1072744 (owner: 10Muehlenhoff) [07:54:41] (03CR) 10Slyngshede: [C:03+2] Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:54:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1075504 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [07:55:40] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [07:55:48] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [07:57:04] (03Merged) 10jenkins-bot: Bump Debian package version to 0.0.9. [software/bitu] - 10https://gerrit.wikimedia.org/r/1075503 (owner: 10Slyngshede) [07:58:10] !log running REPLACE into dtpwiki db2123 (s5) T375507 [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:53] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [07:59:56] (03CR) 10Elukey: "Tested live on registry1005, and it works:" [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [08:00:04] (03CR) 10Elukey: [C:03+2] docker_registry_ha: configure the max entries returned for catalog reqs [puppet] - 10https://gerrit.wikimedia.org/r/1075502 (owner: 10Elukey) [08:00:05] brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T0800). [08:04:56] FIRING: SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:19] !log hashar@deploy1003 Started deploy [integration/docroot@0482d53]: zuul: show change queues window value [08:06:26] !log hashar@deploy1003 Finished deploy [integration/docroot@0482d53]: zuul: show change queues window value (duration: 00m 07s) [08:06:31] !log set max-catalog-entries (changes the default catalog pagination) to 50 for docker-registry - T348876 [08:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:37] T348876: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876 [08:07:14] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.11 point update - https://phabricator.wikimedia.org/T373795#10174363 (10MoritzMuehlenhoff) [08:07:24] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10174367 (10MoritzMuehlenhoff) [08:09:23] (03PS1) 10Muehlenhoff: scap_proxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075507 [08:09:51] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1075507 (owner: 10Muehlenhoff) [08:12:33] (03PS1) 10Brouberol: Deploy an airflow-scheduler ClusteRole to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [08:14:52] (03PS2) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) [08:14:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:15:17] (03PS3) 10Santiago Faci: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) [08:27:34] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging DNdubane out of all services on: 1540 hosts [08:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging DNdubane out of all services on: 1540 hosts [08:28:09] (03CR) 10Hashar: [C:03+1] scap_proxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1075507 (owner: 10Muehlenhoff) [08:28:12] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging DNdubane out of all services on: 700 hosts [08:28:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging DNdubane out of all services on: 700 hosts [08:29:32] (03PS1) 10Muehlenhoff: Remove access for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/1075509 [08:32:15] (03CR) 10Slyngshede: [C:03+1] "Looks good. Needs to be removed from wmf ldap group as well." [puppet] - 10https://gerrit.wikimedia.org/r/1075509 (owner: 10Muehlenhoff) [08:33:49] (03CR) 10Muehlenhoff: [C:03+2] Remove access for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/1075509 (owner: 10Muehlenhoff) [08:39:05] (03PS1) 10Elukey: docker_registry_ha: reduce maxentries' default to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1075510 (https://phabricator.wikimedia.org/T348876) [08:42:42] (03CR) 10FNegri: [C:03+1] [WikiReplicas] Hide autoblock targets in the globalblocks table [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [08:47:59] (03Abandoned) 10Muehlenhoff: Revert "No longer include config-master on Puppet 5 frontends" [puppet] - 10https://gerrit.wikimedia.org/r/1074994 (owner: 10Muehlenhoff) [08:48:58] (03Abandoned) 10Muehlenhoff: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [08:49:56] FIRING: [2x] SystemdUnitFailed: build-homepage.service on registry1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:03] (03CR) 10Jcrespo: [C:03+1] "Looks ok sql-wise, but probably someone more familiar with mediawiki (security or engineering) should give an ok to the final view once de" [puppet] - 10https://gerrit.wikimedia.org/r/1073430 (https://phabricator.wikimedia.org/T371486) (owner: 10Dreamy Jazz) [08:50:54] (03PS2) 10Brouberol: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [08:52:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:53:32] (03CR) 10Arnaudb: [C:03+1] hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [08:54:56] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:54:56] (03CR) 10CI reject: [V:04-1] Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [08:55:44] (03CR) 10Btullis: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [08:56:27] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2002:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [08:59:09] (03CR) 10Brouberol: Deploy an airflow-scheduler SA/Role/Rolebinding to dse-k8s-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [09:01:03] (03CR) 10Elukey: [C:03+2] docker_registry_ha: reduce maxentries' default to 25 [puppet] - 10https://gerrit.wikimedia.org/r/1075510 (https://phabricator.wikimedia.org/T348876) (owner: 10Elukey) [09:01:56] (03CR) 10MVernon: [C:03+2] hiera: specify cluster for apus nodes [puppet] - 10https://gerrit.wikimedia.org/r/1075027 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:03:03] (03PS1) 10Slyngshede: P:idm Add empty ACCESS_REQUEST_RULES to production. [puppet] - 10https://gerrit.wikimedia.org/r/1075512 [09:06:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1075512 (owner: 10Slyngshede) [09:06:58] (03CR) 10Slyngshede: [C:03+2] P:idm Add empty ACCESS_REQUEST_RULES to production. [puppet] - 10https://gerrit.wikimedia.org/r/1075512 (owner: 10Slyngshede) [09:07:24] !log mvernon@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=apus,name=codfw [09:09:56] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:23] (03PS1) 10Slyngshede: IDM: Switch to upgraded IDM host. [dns] - 10https://gerrit.wikimedia.org/r/1075513 [09:11:51] !log set max-catalog-entries (changes the default catalog pagination) to 25 for docker-registry - T348876 [09:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:58] T348876: Container image reports in debmonitor are broken - https://phabricator.wikimedia.org/T348876 [09:12:49] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1075513 (owner: 10Slyngshede) [09:13:43] (03CR) 10Slyngshede: [C:03+2] IDM: Switch to upgraded IDM host. [dns] - 10https://gerrit.wikimedia.org/r/1075513 (owner: 10Slyngshede) [09:17:54] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Move JVM config out of the puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/1074948 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:19:22] !log dcaro@cumin1002 START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on cloudcephosd1025.eqiad.wmnet with reason: Getting the disks shipped to dell T348643 [09:19:28] T348643: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 [09:19:36] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on cloudcephosd1025.eqiad.wmnet with reason: Getting the disks shipped to dell T348643 [09:19:52] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#10174599 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=087b480f-3f34-4877-a07a-3baa2b98f863) s... [09:20:14] 10SRE-tools, 06Infrastructure-Foundations: SREBatchBase: Print allowed aliases in help message - https://phabricator.wikimedia.org/T375590#10174602 (10Volans) With the current API that's not possible because `allowed_aliases` is an instance property (not a class property) of the runner class, not the cookbook... [09:22:54] !log Upgrade idm2001 to Bitu version 0.0.9 [09:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1038.eqiad.wmnet [09:26:21] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:12] (03PS1) 10Muehlenhoff: Switch cloudcephosd1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075514 (https://phabricator.wikimedia.org/T349619) [09:30:46] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075514 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:34:39] (03Abandoned) 10Muehlenhoff: Stop including profile::configmaster in puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/1007363 (https://phabricator.wikimedia.org/T341717) (owner: 10Muehlenhoff) [09:35:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1038.eqiad.wmnet [09:36:31] (03PS1) 10Filippo Giunchedi: vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) [09:36:32] (03PS1) 10Filippo Giunchedi: icinga: replace url checks with pingthing [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) [09:36:49] (03CR) 10CI reject: [V:04-1] vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [09:39:33] !log installing distro-info-data updates from bullseye/bookworm point updates [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:04] (03PS2) 10Filippo Giunchedi: vopsbot: remove systemd::service alert, replaced by alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1075515 (https://phabricator.wikimedia.org/T321808) [09:40:05] (03PS2) 10Filippo Giunchedi: icinga: replace url checks with pingthing [puppet] - 10https://gerrit.wikimedia.org/r/1075516 (https://phabricator.wikimedia.org/T321808) [09:43:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.6 point update - https://phabricator.wikimedia.org/T374536#10174729 (10MoritzMuehlenhoff) [09:45:32] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.10 point update - https://phabricator.wikimedia.org/T368288#10174735 (10MoritzMuehlenhoff) [09:48:39] (03PS1) 10Mvolz: Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 [09:49:43] (03PS2) 10Mvolz: Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) [09:49:56] RESOLVED: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:26] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:13] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:54:19] (03PS3) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [09:54:56] RESOLVED: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:57:25] (03CR) 10CI reject: [V:04-1] Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1000) [10:00:05] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1000). [10:01:48] akosiaris: o/ [10:04:08] (03CR) 10Hnowlan: [C:03+1] "I'm also a little short on the history but it seems like this is very safe to do." [puppet] - 10https://gerrit.wikimedia.org/r/1075152 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [10:04:15] (03PS4) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:06:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1025.eqiad.wmnet [10:07:45] (03PS1) 10Muehlenhoff: Switch cloudcephosd1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075519 (https://phabricator.wikimedia.org/T349619) [10:08:09] akosiaris: you around? [10:09:04] (03CR) 10Tiziano Fogli: [C:03+2] ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:10:45] (03Merged) 10jenkins-bot: ripeatlas: add ping to wmf anchors check [alerts] - 10https://gerrit.wikimedia.org/r/1068732 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:11:46] mvolz: yes [10:12:04] sorry, I was drafting an email. [10:12:09] npn [10:12:12] np* [10:13:55] FIRING: SystemdUnitFailed: upload_puppet_facts.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:04] so I'm thinking we +2 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1075517, we deploy staging, then we either depool codfw or equiad and then deploy there? [10:14:15] eqiad is depooled right now anyway [10:14:16] I assume the least active one should be depooled? [10:14:19] oh it is? [10:14:24] this is the switchover week [10:14:33] it was done yesterday ~at 16:00UTC [10:14:35] oooohhhh [10:14:57] so actually this is a good week for it then. [10:15:00] you should be seeing 0 traffic in graphs btw (health checks aside) [10:15:15] (03CR) 10Alexandros Kosiaris: [C:03+1] Revert^2 "Update Zotero to node 18" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075517 (https://phabricator.wikimedia.org/T361728) (owner: 10Mvolz) [10:15:46] will we see alerts though? [10:15:55] yes [10:16:16] i can see those still look up for eqiad. okay... should we disable them or something to avoid pinging on call? [10:16:39] (I'll do staging now) [10:16:42] I can do that, gimme a sec [10:16:49] ok [10:17:01] both zotero AND citoid, right? [10:17:06] only zotero [10:17:10] ok [10:17:11] citoid we're not touching [10:17:23] altohugh do the alerts give useful info? [10:17:27] probably not. [10:17:31] it might alert because of zotero which is a dependency? [10:17:45] in any case, I 'll disable the paging ones, we 'll see the non paging ones [10:17:57] does eqiad citoid contact eqiad zotero? [10:18:07] wouldn't it just use whichever one is pooled? or not? [10:18:14] ah, right now, no it doesn't. you are right [10:19:03] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [10:19:23] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [10:21:41] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1025 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075519 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:22:08] I see the pod is up. no logs ofc for zotero, the standard envoy startup logs for the tls-proxy sidecar container [10:22:19] what's the test curl call again? [10:22:20] (03PS1) 10Gmodena: mw-page-content-change-enrich: enable claico network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) [10:22:33] https://wikitech.wikimedia.org/wiki/Zotero/Deploying_zotero#Staging_server [10:22:38] I tried both of the samples there [10:22:42] they responded just fine. [10:23:42] well, off to eqiad then ? [10:23:55] sure, will do [10:24:10] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [10:24:35] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [10:25:05] oh no it's working [10:25:13] lol [10:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1025.eqiad.wmnet [10:25:57] curl -k -d 'https://en.wikipedia.org/wiki/Darth_Vader' -H 'Content-Type: text/plain' https://zotero.svc.eqiad.wmnet:4969/web I think is even the probe query that goes down [10:25:59] (03CR) 10Btullis: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:28:24] (03CR) 10Gmodena: mw-page-content-change-enrich: enable claico network policies. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075520 (https://phabricator.wikimedia.org/T373195) (owner: 10Gmodena) [10:28:47] mvolz: can't say I can reproduce. It's consistently returning ok? [10:29:34] takes a while but on every call I see [{"key":"TYBXVFQ6","version":0 yada yada yada [10:29:42] Yeah works fine for me too [10:31:31] James_F recently posted something about how there are errors for the tls proxies or something? [10:32:01] alerts about them being at times close to memory limits [10:32:20] (03PS1) 10Abijeet Patro: Revert^2 "Enable message group subscription feature for Test Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 [10:32:52] which we can bump ofc, but there should be something more (like a log line) that points out that's the issue [10:33:06] wanna go for codfw? [10:33:22] at least at that point we 'll know [10:33:39] sure but I will bet you it alerts :P [10:33:59] I can also repool eqiad for a while [10:34:14] if we think it's traffic related, that should point to something [10:34:46] so it's going to Alert citoid, because citoid is the proxy for zotero [10:34:56] basically swagger checks whether the response came from zotero [10:35:13] (03PS2) 10Abijeet Patro: Translate: Add VirtualDomainsMapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) [10:35:21] since citoid (ideally) has a native scraping option, then it'll 200 but it won't be from zotero, and then will alert. [10:36:06] (03CR) 10Daniel Kinzler: [C:03+1] "yes, please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075335 (https://phabricator.wikimedia.org/T375512) (owner: 10BPirkle) [10:36:39] mvolz: let's see, it's easy enough to depool eqiad anyway [10:36:53] I 'll only pool zotero, not both [10:37:17] ok, let me know when you're done repooling [10:38:44] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route pool zotero in eqiad: maintenance [10:41:58] (03CR) 10Ladsgroup: [C:03+1] Translate: Add VirtualDomainsMapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [10:42:54] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:43:10] I can see some 500s [10:43:26] akosiaris: where are you looking? [10:43:39] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:43:39] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-kubernetes_namespace=citoid&var-app=All&var-destination=zotero&from=now-15m&to=now [10:43:45] nothing out of the ordinary though [10:43:48] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool zotero in eqiad: maintenance [10:44:35] it's done btw [10:44:38] (03Merged) 10jenkins-bot: MPIC: Deploying on staging a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075207 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:44:40] yeah it looks normal [10:44:47] 500s are kind of normal for Zotero [10:44:59] it reports a lot of things that are more like 4xx as 500 [10:45:51] well, deploy to codfw I 'd say? by now I don't know much more that we can do to gain more confidence [10:46:16] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:46:37] !log sfaci@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:47:07] I'm a little confused, in grafana the number of requests don't seem to show any decrease from the depooling? [10:47:26] which graph are you looking at? [10:47:37] the linked envoy telemetry one [10:48:08] oh nevermind that's because it's codfw [10:48:10] haha [10:48:42] ah wait, there is 1 more thing I can do [10:48:47] I 'll depool zotero in codfw [10:48:49] gimme a sec [10:48:58] in which case I don't see an increase in requests after re-pooling [10:49:06] (03PS1) 10Samtar: IS-labs: Enable wgUseCodexSpecialBlock on test.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075523 (https://phabricator.wikimedia.org/T375610) [10:49:09] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route depool zotero in codfw: maintenance [10:49:10] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [10:49:14] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [10:50:34] (03PS1) 10Santiago Faci: MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) [10:52:29] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-app=All&var-destination=All&from=now-30m&to=now [10:52:34] (03CR) 10Sg912: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:52:38] traffic to eqiad zotero is definitely increasing [10:52:54] it's still 0.5rps, but it is what it is [10:53:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, September 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1070562 (https://phabricator.wikimedia.org/T369069) (owner: 10Sergio Gimeno) [10:53:24] max apparently in the last 2 days has been ~4 [10:53:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1037.eqiad.wmnet [10:54:08] i don't see it in envoy? [10:54:13] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool zotero in codfw: maintenance [10:54:15] https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=codfw&var-prometheus=k8s&var-kubernetes_namespace=zotero&var-app=zotero&var-destination=local_service&from=now-30m&to=now [10:54:21] this ^ is codfw dropping [10:54:32] switch to eqiad on the dropdown and it should be increasing [10:54:43] (03PS1) 10Muehlenhoff: Switch cloudcephosd1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075526 (https://phabricator.wikimedia.org/T349619) [10:54:47] ok [10:54:59] this dashboard could be done a bit better, but I 'll file that for later [10:55:26] I had it set to zotero and not local service [10:55:30] not sure what the difference is :P [10:55:33] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075526 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:56:30] still looking okay to me, you? [10:56:33] (03CR) 10Santiago Faci: [C:03+2] MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:56:36] "zotero" is as citoid side sees it. local_service is as the local envoy sees it [10:56:46] (03CR) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:56:53] (03CR) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) (owner: 10Brouberol) [10:56:59] but since all service have a "local_service" you need to take a bit of care to only pick the proper service [10:57:33] (03Merged) 10jenkins-bot: MPIC: Deploying to production a new relase v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075524 (https://phabricator.wikimedia.org/T373473) (owner: 10Santiago Faci) [10:57:40] so, yeah depooling of codfw and full repooling of eqiad has happened and I see 0 worrying things up to now [10:57:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075521 (https://phabricator.wikimedia.org/T372287) (owner: 10Abijeet Patro) [10:57:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [10:58:02] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:58:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, September 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075522 (owner: 10Abijeet Patro) [10:58:20] (03PS5) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:59:13] (03PS6) 10Brouberol: Specify a custom deploy clusterrole for airflow namespaces in dse [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075508 (https://phabricator.wikimedia.org/T364389) [10:59:21] mvolz: you promised me alerts and I see none :P [10:59:32] i'm sorry :( [10:59:49] I would ask what it is that you witnessed last time, but with zotero not emitting either metrics or logs that would be pointless [10:59:50] the only thing that changed since then is envoy has been updated. [11:00:08] the last time the swagger probe started failing [11:00:11] jouncebot: now [11:00:11] No deployments scheduled for the next 1 hour(s) and 59 minute(s) [11:00:16] jouncebot: next [11:00:17] In 1 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240925T1300) [11:00:46] https://phabricator.wikimedia.org/T361728 [11:00:58] (03Abandoned) 10Brouberol: modules: release a new version of app.job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021843 (https://phabricator.wikimedia.org/T362954) (owner: 10Brouberol) [11:01:16] yeah all we know is zotero just wasn't responding [11:01:19] or was giving errors [11:01:23] one of those two [11:01:33] and citoid was like nvm i'll do it myself [11:02:49] Actually jon said something about tls terminator? [11:02:56] RESOLVED: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [11:03:01] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:03:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cloudcephosd1037.eqiad.wmnet [11:03:58] mvolz: so, zotero codfw right now is depooled. Wanna try upgrading there? [11:04:06] sure [11:04:21] let's run some tests there too and then I 'll pool it, then wait it out and then depool eqiad again [11:04:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cloudcephosd1036.eqiad.wmnet [11:04:50] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:05:19] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:06:19] (03PS1) 10Muehlenhoff: Switch cloudcephosd1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075527 (https://phabricator.wikimedia.org/T349619) [11:07:20] Looks okay again. [11:08:07] (03PS1) 10Elukey: profile::trafficserver::backend: change timeouts for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) [11:08:21] οκ, repooling codfw then [11:08:32] !log akosiaris@cumin1002 START - Cookbook sre.discovery.service-route pool zotero in codfw: maintenance [11:08:33] !log akosiaris@cumin1002 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [11:08:37] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [11:08:53] which alerts were turned off? was it just the eqiad ones? [11:09:03] (03PS1) 10Gmodena: mw-page-content-change-enrich: disable legary network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) [11:09:25] (03CR) 10Muehlenhoff: [C:03+2] Switch cloudcephosd1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1075527 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:09:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4118/co" [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [11:09:54] (03PS2) 10Gmodena: mw-page-content-change-enrich: disable legacy network policies. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075529 (https://phabricator.wikimedia.org/T373195) [11:10:35] !log running UPDATE into viwiki db2218 (s7) T375507 [11:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [11:10:56] Processing latency of WCQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [11:11:19] (03CR) 10Elukey: [V:03+1] "Hi folks! I have no idea if this is the preferred/best way forward on the ATS side, lemme know if you feel differently." [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [11:11:33] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host cloudcephosd1036.eqiad.wmnet [11:11:52] mvolz: none right now, they 've expired [11:11:59] ok [11:13:12] (03CR) 10Gmodena: [C:03+2] dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [11:13:36] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool zotero in codfw: maintenance [11:13:52] this is ridiculous https://gerrit.wikimedia.org/r/c/mediawiki/services/zotero/+/1075530 [11:14:21] (03Merged) 10jenkins-bot: dse-k8s-services: fix values in dump enrichment app. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1075226 (https://phabricator.wikimedia.org/T368787) (owner: 10Gmodena) [11:15:07]