[00:08:48] rzl: I'm in, ty! [00:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:21:41] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:26:43] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:28:31] 10SRE, 10Gerrit, 10Traffic, 10Patch-For-Review, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Quiddity) >>! In T191183#8250010, @kostajh wrote: >>>! In T191183#8249977, @hashar wrote: >> Maybe if one day MediaWiki supports attac... [01:10:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:11:19] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:22:11] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) resolved: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:47] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:55:03] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:54:15] (03PS1) 10KartikMistry: Enable Content and Section Translation in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833885 (https://phabricator.wikimedia.org/T313296) [04:05:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:07:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48681 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:43:23] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:09:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:11:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48682 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:05] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T0600). nyaa~ [07:00:04] Amir1, apergos, and jnuche: May I have your attention please! UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T0700) [07:00:04] kart_: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:11] morning! no trainees are signed up for the window, and we have one patch scheduled for deployment. kart_ I presume you will be self-deploying today? [07:00:49] apergos: yes :) [07:00:55] sound sgreat! [07:01:00] I'll go ahead then.. [07:01:06] please do! [07:01:39] (03CR) 10KartikMistry: [C: 03+2] Enable Content and Section Translation in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833885 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [07:02:24] (03Merged) 10jenkins-bot: Enable Content and Section Translation in Bhojpuri Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833885 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [07:05:46] Deploying.. [07:06:16] sweet! [07:08:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:09:41] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:833885|Enable Content and Section Translation in Bhojpuri Wikipedia (T313296)]] (duration: 04m 03s) [07:09:44] T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296 [07:10:55] OK. I'm done :) [07:11:15] looks fine to me! [07:13:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:13:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:31:33] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:35:29] guess I ought to close out the backport window officially since nothing else is happening [07:35:55] !log UTC morning backport and config training deployment window closed a bit belatedly [07:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:08] (03PS1) 10DCausse: rdf-streaming-updater: use rdf-streaming-updater-codfw swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/833991 (https://phabricator.wikimedia.org/T316028) [07:49:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37312/console" [puppet] - 10https://gerrit.wikimedia.org/r/833842 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [07:51:06] 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10jbond) 05Open→03Resolved This is all [[ https://puppet-compiler.wmflabs.org/pcc-worker1003/37312/an-launcher1002.eqiad.wmnet/fulldiff.html | working again ]] now... [07:55:02] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37313/console" [puppet] - 10https://gerrit.wikimedia.org/r/833842 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [07:55:29] (03PS5) 10Jbond: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [07:56:27] (03CR) 10CI reject: [V: 04-1] Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [07:56:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37314/console" [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [08:00:05] jnuche and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T0800). [08:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:36:06] (03CR) 10Jbond: update-known-hosts-production: Capture all fingerprints (032 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [08:43:49] (03CR) 10Jbond: [C: 04-1] update-known-hosts-production: Capture all fingerprints (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [08:56:54] (03PS1) 10Volans: roll-restart-reboot-docker-registry: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/833996 [09:01:58] (03PS1) 10Jbond: P:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [09:02:57] (03PS2) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [09:05:09] (03PS3) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [09:12:41] (03PS4) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [09:14:07] PROBLEM - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:35] 10SRE, 10Discovery-Search, 10Elasticsearch: Port elasticsearch support scripts to cookbooks - https://phabricator.wikimedia.org/T269218 (10Gehel) 05Open→03Declined Those scripts have not been used in forever, porting them makes little sense if they are not useful. [09:16:49] (03PS5) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [09:18:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:39:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:53:25] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:55:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:59:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:00:05] mvolz: gettimeofday() says it's time for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1000) [10:17:38] (03PS1) 10DCausse: rdf-streaming-updater: alert on thanos-swift space usage [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) [10:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET namespaces) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:25] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:48:23] (03PS3) 10Anzx: Add wgMetaNamespace for knwiktionary and knwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833816 [10:55:17] (03PS6) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [10:56:48] (03PS7) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [10:58:18] (03PS8) 10Jbond: C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) [11:01:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37322/console" [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:13:17] (03PS1) 10Jbond: C:ssh::publish_fingerprints: add combined file [puppet] - 10https://gerrit.wikimedia.org/r/834015 (https://phabricator.wikimedia.org/T318006) [11:15:26] (03PS1) 10Jbond: C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) [11:15:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37323/console" [puppet] - 10https://gerrit.wikimedia.org/r/834015 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:18:54] (03CR) 10CI reject: [V: 04-1] C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:27] (03PS1) 10DDesouza: Remove Research Incentive survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834027 (https://phabricator.wikimedia.org/T316466) [11:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:30] (03PS2) 10Jbond: C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) [11:59:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:ssh::publish_fingerprints: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/833997 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [11:59:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:ssh::publish_fingerprints: add combined file [puppet] - 10https://gerrit.wikimedia.org/r/834015 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [12:00:07] (03PS2) 10Jbond: C:ssh::publish_fingerprints: add combined file [puppet] - 10https://gerrit.wikimedia.org/r/834015 (https://phabricator.wikimedia.org/T318006) [12:00:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37325/console" [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [12:02:31] PROBLEM - SSH on mw1311.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:04:11] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: use rdf-streaming-updater-codfw swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/833991 (https://phabricator.wikimedia.org/T316028) (owner: 10DCausse) [12:09:52] (03Merged) 10jenkins-bot: rdf-streaming-updater: use rdf-streaming-updater-codfw swift container [deployment-charts] - 10https://gerrit.wikimedia.org/r/833991 (https://phabricator.wikimedia.org/T316028) (owner: 10DCausse) [12:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:21:59] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:22:37] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:22:40] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [12:23:19] (03PS1) 10Jbond: update-known-hosts-production: Capture all fingerprints [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) [12:23:22] (03PS1) 10Jbond: 0.5.4: Prpare release [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834039 [12:24:33] (03CR) 10Jbond: [C: 04-1] "see https://gerrit.wikimedia.org/r/c/operations/debs/wmf-sre-laptop/+/834038" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/832698 (https://phabricator.wikimedia.org/T318006) (owner: 10Bking) [12:24:40] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:24:48] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [12:25:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:45] (03PS3) 10Jbond: C:ssh::publish_fingerprints: drop RSA support [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) [12:30:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:32:02] (03CR) 10Jbond: [C: 04-1] "self -1: this should wait untill we have rebuild the wmf-sre-laptop package and have confidence that it has been upgraded in most places" [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [12:32:43] (03CR) 10Jbond: C:ssh::publish_fingerprints: drop RSA support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [12:34:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/833996 (owner: 10Volans) [12:37:50] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:38:11] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [12:39:10] (03PS1) 10DDesouza: Deploy Research Incentive survey on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834042 (https://phabricator.wikimedia.org/T318328) [12:41:27] (03PS1) 10DDesouza: Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) [12:42:26] (03PS1) 10Vgutierrez: Release 9.1.3-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/834045 (https://phabricator.wikimedia.org/T317660) [12:42:34] (03CR) 10CI reject: [V: 04-1] Deploy Research Incentive survey on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834044 (https://phabricator.wikimedia.org/T318331) (owner: 10DDesouza) [12:57:14] (03PS1) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) [12:59:49] (03CR) 10Volans: [C: 03+2] roll-restart-reboot-docker-registry: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/833996 (owner: 10Volans) [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1300). [13:00:05] anoop, zabe, and danisztls: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] I can deploy today [13:00:20] o/ [13:00:22] o/ [13:00:37] o/ [13:01:15] (03CR) 10Urbanecm: [C: 03+2] Add wgMetaNamespace for knwiktionary and knwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833816 (owner: 10Anzx) [13:02:01] (03Merged) 10jenkins-bot: Add wgMetaNamespace for knwiktionary and knwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833816 (owner: 10Anzx) [13:03:02] zabe: ad your patch, are you sure 0 is depooled? https://noc.wikimedia.org/db.php indicates it means "primary" [13:03:09] (03Merged) 10jenkins-bot: roll-restart-reboot-docker-registry: fix docstring [cookbooks] - 10https://gerrit.wikimedia.org/r/833996 (owner: 10Volans) [13:03:43] RECOVERY - SSH on mw1311.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:03:45] anoop: pulled to mwdebug1001, can you test please? [13:03:54] ok [13:03:54] urbanecm: the first element is the primary, not weight 0 [13:03:59] in production, those are the same thing [13:03:59] (03PS2) 10Zabe: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) [13:04:06] usually [13:04:20] (03PS3) 10Zabe: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) [13:04:23] taavi: aha! thanks for explaining. so, is zabe's patch correct? [13:04:29] (if you have a while to +1 it, I'd appreciate it) [13:05:01] 10SRE, 10Performance-Team, 10serviceops: Evaluate using igbinary for MW php-apcu at WMF - https://phabricator.wikimedia.org/T225074 (10Krinkle) [13:05:07] @urbanecm. working fine [13:05:14] anoop: thanks, syncing [13:05:40] zabe: may I ask what's the point of adding a new replica there as weight 0 since in beta you need a separate config change to pool it anyways? [13:06:23] (03PS2) 10Urbanecm: Remove Research Incentive survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834027 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [13:06:27] (03CR) 10Urbanecm: [C: 03+2] Remove Research Incentive survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834027 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [13:07:20] (03Merged) 10jenkins-bot: Remove Research Incentive survey on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834027 (https://phabricator.wikimedia.org/T316466) (owner: 10DDesouza) [13:07:28] taavi, I can check whether it works through mwmaint (since I can specify the targeted host there) and if it does I can pool it [13:08:26] zabe: `taavi@deployment-mwmaint02:~$ sql enwiki -- --host=deployment-db10` works without that [13:08:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ff867a48d617bc556be23ac595c4e3c5466f69c1: Add wgMetaNamespace for knwiktionary and knwikiquote (T318318) (duration: 03m 57s) [13:09:33] T318318: change wgMetaNamespace for knwiktionary and knwikiquote - https://phabricator.wikimedia.org/T318318 [13:09:33] interesting [13:09:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:09:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:09:49] I can merge the two patches [13:09:49] danisztls: your patch is at mwdebug1001, can you test? [13:09:56] urbanecm: yes [13:10:04] let me know how i goes :) [13:10:15] zabe: wdym? [13:10:39] urbanecm: looks good [13:10:44] thanks, syncing [13:10:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:19] I had two patches with the intention of testing between them, but since that is not necesarry you can either merge them together or I merge them into one patch [13:11:29] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Dell had not responded for ticket. Reopened a new ticket Confirmed: Service Request 152257151 was successfully submitted. [13:12:30] (03Abandoned) 10Zabe: beta: Pool deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833866 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:12:33] (03PS4) 10Zabe: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) [13:12:35] zabe: oh, different kind of merge :). [13:12:40] let me know once it's ready [13:12:50] should be ready [13:13:07] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Thanks!! [13:13:12] taavi, if you wanna take another look [13:13:15] (03PS5) 10Urbanecm: beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:13:19] (03CR) 10Urbanecm: [C: 03+2] beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:13:27] (removed -2) [13:14:34] (03PS1) 10DDesouza: Deploy Research Incentive on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834050 (https://phabricator.wikimedia.org/T318333) [13:14:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: dcf37106d32ddda58948dbd6bc7ef3eb823a8e3d: Remove Research Incentive survey on idwiki (T316466) (duration: 03m 50s) [13:14:54] T316466: Deploy Research Incentive Survey on Indonesian Wikipedia - https://phabricator.wikimedia.org/T316466 [13:14:55] danisztls: should be live [13:15:20] urbanecm: thanks [13:15:22] (03PS2) 10DDesouza: Deploy Research Incentive survey on enwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834050 (https://phabricator.wikimedia.org/T318333) [13:15:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:55] zabe: +2'ing, let's test and see, it looks good to me [13:16:58] (03CR) 10Urbanecm: [C: 03+2] beta: Add deployment-db10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833865 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [13:17:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:18:51] kicked beta-code-update-eqiad [13:19:06] it's not merged yet [13:19:13] oh, it is [13:19:15] just wikibugs didn't say [13:19:23] anyway, should be live soon :) [13:21:26] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [13:21:53] is live [13:22:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:23:24] `show full processlist;` shows db traffic and beta logstash looks clear [13:23:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:24:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:12] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Operations: Requesting Kerberos access for alinebruenger and siko - https://phabricator.wikimedia.org/T316766 (10Aline_Bruenger_WMDE) 05Open→03Resolved a:03Aline_Bruenger_WMDE Thanks a lot! [13:56:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:57:24] ^ this me [13:57:29] *is [13:57:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) @jcrespo @Marostegui Those host names have been used I have entered into netbox db1204 , db1205. Please confirm those names will work [14:01:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:06:11] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:17:04] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:20:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) db1204 E3 U24 port 20 Cableid 20220227 db1205 F3 U24 port 20 Cableid 20220228 [14:20:37] PROBLEM - SSH on analytics1077.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:20:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db119[67] - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) [14:21:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:29:59] 10SRE, 10Data Pipelines, 10Infrastructure-Foundations, 10puppet-compiler: Systematic PCC error - https://phabricator.wikimedia.org/T318281 (10Antoine_Quhen) Working. Thanks! [14:43:05] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:43:35] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:18:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:20:40] random question: is the PHP 7.4 migration blocked on anything, or is it just waiting for the right people to get back from summits/offsites/vacation? [15:23:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:28:38] Lucas_WMDE: AFAIK it's the latter, yeah [15:28:40] Is there some way to get prometheus metrics such as varnish_requests{status=~"5.."} but limiting to specific backend services? [15:29:11] Reedy: good to know, thanks! [15:30:53] Lucas_WMDE: From what I can see, there might be a few bugs that have been noticed, but nothing substantially blockers [15:31:07] awight: not on varnish metrics at least, as varnish always to connects to ats-be [15:32:07] taavi: Interesting, I thought it might be logged because maps has an idea of which backend it's connecting to eg. the maps.wikimedia.org service [15:32:35] Is there any other place I can look to find the rate of 5xx errors on a service? [15:35:32] awight: the short answer is, it depends :) different services export different metrics to prometheus -- some software breaks the metrics down by backend, and some doesn't [15:35:47] for anything that goes through Envoy, we do have pretty good metrics, and you can break them down by origin and destination: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 [15:37:02] so if you wanted to see for example request rates, errors, and latency for all the requests from appservers to sessionstore, that's a good dashboard for it [15:37:57] but otherwise, service by service, it just depends on what metrics are exported and then which ones we have dashboarded [15:38:32] 10SRE-OnFire, 10Observability-Alerting, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Improve Search team alerting for missing masters - https://phabricator.wikimedia.org/T313095 (10bking) [15:38:52] (03PS1) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [15:39:43] (03CR) 10CI reject: [V: 04-1] P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [15:40:38] (03PS2) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [15:40:39] rzl: Thanks, I was worried this might be the case but it's helpful to know. My issue is that the service isn't logging its errors consistently, so I want a more objective measure of what users experience. [15:41:28] (03CR) 10CI reject: [V: 04-1] P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [15:42:03] (03PS3) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [15:42:20] I think I'll suggest that maps be integrated with envoy [15:42:33] ah, got it -- prometheus metrics are still pretty much just whatever the exporting server says they are, too, so you'd have to get metrics from the requesting side if you're worried about missing data [15:42:52] (03CR) 10CI reject: [V: 04-1] P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [15:43:33] who's connecting to whom in this case? [15:43:54] (03PS4) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [15:44:44] (03CR) 10CI reject: [V: 04-1] P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) (owner: 10Majavah) [15:45:27] (03PS5) 10Majavah: P:terraform: add a new basic terraform module registry [puppet] - 10https://gerrit.wikimedia.org/r/834344 (https://phabricator.wikimedia.org/T317480) [15:49:06] rzl: These are end-user browser requests to maps.wikimedia.org. For example, https://maps.wikimedia.org/img/osm-intl,6,53.383333,-1.466667,300x400.png?lang=en&domain=en.wikipedia.org&title=Downton+Abbey&revid=1107908463&groups=_ca5fe2a1449687fe54ae1fdbde7b637cd662d3b7 [15:49:50] got it, okay [15:52:15] so if you don't trust maps.wm.o to log errors correctly, you probably can't rely on it for accurate prometheus metrics, either -- inserting Envoy is definitely an option in order to get real-time grafana graphs, but you should also be able to get solid data from the edge's POV via the analytics cluster [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1600). [16:00:05] TheresNoTime and dancy: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] o/ [16:01:19] * TheresNoTime is here [16:02:12] hello! looking [16:04:52] TheresNoTime: your code looks good but I want to route this to the observability team instead of merging directly -- I'm not sure if we have a principled stance on adding beta monitoring to the prod alertmanager [16:05:20] I see profile::wikifunctions::beta is already there but I'm not sure if that's an exception or a new rule :) [16:05:33] rzl: okay! :) [16:05:48] unfortunately the team's all at a summit so I probably can't get you an answer today, but I'll see if I can find you one tomorrow [16:06:23] No worries, there's no urgency to that change at all [16:06:27] I'm adding a reviewer at least, and I'll ping folks in the morning to try and at least get you something prompt-ish, if not actually snappy [16:06:30] whew, okay [16:09:20] rzl: fwiw, without context, I oppose using prod alerting infrastructure to monitor deployment-prep [16:11:27] taavi: ack, you likely won't be alone in that position :) we'll arrive at some sort of solution even if it looks different from this one [16:11:56] dancy: haven't forgotten you, with you in just a sec [16:12:06] ok. [16:15:53] (03CR) 10Ssingh: [C: 03+1] Release 9.1.3-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/834045 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [16:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:17:10] (03CR) 10RLazarus: [C: 04-1] "Puppet request window here -- the code LGTM but adding Filippo: should we add more beta monitoring to the prod alertmanager, or suggest a " [puppet] - 10https://gerrit.wikimedia.org/r/832326 (https://phabricator.wikimedia.org/T315695) (owner: 10Samtar) [16:18:51] (03CR) 10RLazarus: [C: 03+2] scap.cfg.erb: Set initial value of beta_only_config_files [puppet] - 10https://gerrit.wikimedia.org/r/833455 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [16:19:23] dancy: the merge is running now -- want a manual puppet run anywhere? [16:19:32] deploy1002 I guess? [16:19:36] Sure. [16:19:41] yeah, deploy1002 [16:21:06] dancy: done, three entire minutes ahead of schedule it turned out :) [16:21:07] test at will [16:21:50] I see the new /etc/scap.cfg on deploy1002. That's all I need. [16:21:57] 👍 [16:22:02] Thanks again [16:22:07] thank you! [16:23:11] RECOVERY - SSH on analytics1077.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:33:31] (03PS1) 10Ahmon Dancy: InitialiseSettings-labs.php: Added test text (to be reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834352 (https://phabricator.wikimedia.org/T317242) [16:36:42] joucebot nowandnext [16:36:47] jouncebot nowandnext [16:36:47] For the next 0 hour(s) and 23 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1600) [16:36:47] In 0 hour(s) and 23 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1700) [16:37:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834352 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [16:38:17] (03Merged) 10jenkins-bot: InitialiseSettings-labs.php: Added test text (to be reverted) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834352 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [16:38:51] !log dancy@deploy1002 Started scap: Backport for [[gerrit:834352|InitialiseSettings-labs.php: Added test text (to be reverted) (T317242)]] [16:38:55] T317242: Make "scap backport" skip syncing steps for labs-only changes - https://phabricator.wikimedia.org/T317242 [16:39:16] !log dancy@deploy1002 dancy and dancy: Backport for [[gerrit:834352|InitialiseSettings-labs.php: Added test text (to be reverted) (T317242)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [16:39:20] !log dancy@deploy1002 Sync cancelled. [16:40:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834352 (https://phabricator.wikimedia.org/T317242) (owner: 10Ahmon Dancy) [16:40:30] (03PS1) 10Ahmon Dancy: Revert "InitialiseSettings-labs.php: Added test text (to be reverted)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833819 [16:42:32] (03CR) 10Bking: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [16:42:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:43:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:43:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:44:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:45:51] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: alert on thanos-swift space usage [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [16:46:59] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: alert on thanos-swift space usage [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [16:49:47] (03CR) 10BCornwall: "This is a change that seems reasonable on the surface. However, I lack the historical context surrounding update-known-hosts-production an" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/834038 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [16:54:03] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) @Marostegui Any chance we can take this server down so i can shift memory around? to see if it follows the bad dimm to new location [16:56:43] (03PS8) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [16:57:36] (03CR) 10Aqu: "Thanks for the review. Changes added." [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [16:58:39] (03CR) 10DCausse: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [16:58:50] (03CR) 10BCornwall: "This is a change that seems reasonable on the surface. However, I lack the historical context surrounding and don't feel so comfortable +1" [puppet] - 10https://gerrit.wikimedia.org/r/834017 (https://phabricator.wikimedia.org/T318006) (owner: 10Jbond) [16:59:54] (03CR) 10Ahmon Dancy: "Sorry about the CI troubles. A recently-introduced problem is in the process of being reverted." [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [17:00:05] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1700). [17:03:47] (03CR) 10Ahmon Dancy: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [17:10:55] (03CR) 10Ottomata: Deploy Spark 3 conf and debian pkg to test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [17:14:55] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:15:49] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:17:44] (03PS9) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) [17:18:04] (03CR) 10Aqu: Deploy Spark 3 conf and debian pkg to test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [17:21:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [17:32:40] (03CR) 10Ottomata: [C: 03+1] Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [17:38:27] (03CR) 10DCausse: [C: 03+2] "Trying again, thanks for looking into this Ahmon! :)" [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [17:41:00] (03Merged) 10jenkins-bot: rdf-streaming-updater: alert on thanos-swift space usage [alerts] - 10https://gerrit.wikimedia.org/r/834008 (https://phabricator.wikimedia.org/T316005) (owner: 10DCausse) [17:43:05] (03CR) 10Ottomata: [C: 03+2] Deploy Spark 3 conf and debian pkg to test cluster [puppet] - 10https://gerrit.wikimedia.org/r/833406 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [17:44:20] (03PS2) 10Sbailey: Enable Linter write of namespace tag and template fields on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) [17:46:43] (03CR) 10Sbailey: "Arlo, thanks for pointing out test2wiki is less desirable than testwiki, fixed it for both config values." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833454 (https://phabricator.wikimedia.org/T175177) (owner: 10Sbailey) [17:51:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:03] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:58:18] (03PS1) 10Aqu: Add missing Spark 3 on an-test-coord* [puppet] - 10https://gerrit.wikimedia.org/r/834359 (https://phabricator.wikimedia.org/T312882) [17:58:53] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add missing Spark 3 on an-test-coord* [puppet] - 10https://gerrit.wikimedia.org/r/834359 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [18:00:00] (03PS1) 10BCornwall: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) [18:00:04] jnuche and dancy: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T1800) [18:00:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:42] (03CR) 10CI reject: [V: 04-1] lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) (owner: 10BCornwall) [18:07:37] o/ [18:08:00] (03PS1) 10TrainBranchBot: group2 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834361 (https://phabricator.wikimedia.org/T314191) [18:08:01] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834361 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [18:08:45] (03Merged) 10jenkins-bot: group2 wikis to 1.40.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834361 (https://phabricator.wikimedia.org/T314191) (owner: 10TrainBranchBot) [18:15:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:16:13] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:16:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:18:47] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 3 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10George_Chernilevsky) This file also cann... [18:20:06] (03PS2) 10BCornwall: lvs: Convert ::lvs::configuration to a profile [puppet] - 10https://gerrit.wikimedia.org/r/834360 (https://phabricator.wikimedia.org/T264132) [18:21:42] (03PS1) 10Arlolra: Restrict figure to the size of the media [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) [18:22:36] !log dancy@deploy1002 Installing scap version "4.22.0" for 561 hosts [18:22:54] !log dancy@deploy1002 Installation of scap version "4.22.0" completed for 561 hosts [18:23:35] !log dancy@deploy1002 Locking from deployment [ALL REPOSITORIES]: testing (planned duration: 60m 00s) [18:23:38] !log dancy@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: testing (duration: 00m 02s) [18:28:58] 10SRE, 10DC-Ops, 10Traffic-Icebox, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10BCornwall) @Vgutierrez What actions, if any, are still required for this ticket? [18:29:38] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.40.0-wmf.2 refs T314191 [18:29:42] T314191: 1.40.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T314191 [18:29:48] Train is done. [18:33:55] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@265686e]: (no justification provided) [18:34:09] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@265686e]: (no justification provided) (duration: 00m 13s) [18:37:27] !log jhuneidi@deploy1002 Started scap: testing [18:38:08] !log dancy@deploy1002 Started scap: testing [18:38:11] !log jhuneidi@deploy1002 Started scap: testing [18:46:04] (03PS1) 10Ottomata: debconf::set - add $owner param, set owner in conda_analytics/init.pp [puppet] - 10https://gerrit.wikimedia.org/r/834365 (https://phabricator.wikimedia.org/T312882) [18:47:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37329/console" [puppet] - 10https://gerrit.wikimedia.org/r/834365 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [18:48:27] (03PS1) 10Arlolra: Fix media alignment since disabling wgParserEnableLegacyMediaDOM [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) [18:50:35] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:50:39] (03CR) 10Aqu: [C: 03+1] debconf::set - add $owner param, set owner in conda_analytics/init.pp [puppet] - 10https://gerrit.wikimedia.org/r/834365 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [18:50:48] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10bking) [18:51:40] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) 05Resolved→03Open Re-opening so I can document the above swiftly commands on Wikitech. [18:56:05] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:00:29] (03PS2) 10Ottomata: debconf::set - add $owner param, set owner in conda_analytics/init.pp [puppet] - 10https://gerrit.wikimedia.org/r/834365 (https://phabricator.wikimedia.org/T312882) [19:00:36] (03PS1) 10Bking: admin: add my tmux dotfile [puppet] - 10https://gerrit.wikimedia.org/r/834368 [19:01:00] (03PS2) 10Bking: admin: add bking's tmux dotfile [puppet] - 10https://gerrit.wikimedia.org/r/834368 [19:03:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:04:49] (03CR) 10Ottomata: [C: 03+2] debconf::set - add $owner param, set owner in conda_analytics/init.pp [puppet] - 10https://gerrit.wikimedia.org/r/834365 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [19:05:19] (03PS1) 10Ryan Kemper: ryankemper: add tmux, vim, zsh conf [puppet] - 10https://gerrit.wikimedia.org/r/834369 [19:07:55] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:35] (03PS3) 10Bking: admin: add my tmux dotfile [puppet] - 10https://gerrit.wikimedia.org/r/834368 [19:08:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [19:09:00] (03CR) 10Bking: [C: 03+1] ryankemper: add tmux, vim, zsh conf [puppet] - 10https://gerrit.wikimedia.org/r/834369 (owner: 10Ryan Kemper) [19:09:25] (03PS2) 10Ssingh: Release 9.1.3-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/834045 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [19:09:47] (03CR) 10Jdlrobson: [C: 03+1] Restrict figure to the size of the media [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [19:09:51] (03CR) 10Jdlrobson: [C: 03+1] Fix media alignment since disabling wgParserEnableLegacyMediaDOM [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [19:10:47] (03CR) 10Ryan Kemper: [C: 03+1] "I might put bking in the commit subject" [puppet] - 10https://gerrit.wikimedia.org/r/834368 (owner: 10Bking) [19:11:20] (03PS4) 10Bking: admin: add bking's tmux dotfile [puppet] - 10https://gerrit.wikimedia.org/r/834368 [19:15:00] (03CR) 10Ryan Kemper: [C: 03+2] ryankemper: add tmux, vim, zsh conf [puppet] - 10https://gerrit.wikimedia.org/r/834369 (owner: 10Ryan Kemper) [19:17:00] (03CR) 10Bking: [C: 03+2] admin: add bking's tmux dotfile [puppet] - 10https://gerrit.wikimedia.org/r/834368 (owner: 10Bking) [19:20:37] PROBLEM - SSH on db1098.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:35:20] (03PS1) 10Ottomata: Set spark3 config on hadoop workers, test install only on one worker [puppet] - 10https://gerrit.wikimedia.org/r/834370 (https://phabricator.wikimedia.org/T312882) [19:36:29] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37330/console" [puppet] - 10https://gerrit.wikimedia.org/r/834370 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [19:38:25] (03PS2) 10Ottomata: Set spark3 config on hadoop workers, test install only on one worker [puppet] - 10https://gerrit.wikimedia.org/r/834370 (https://phabricator.wikimedia.org/T312882) [19:39:52] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37332/console" [puppet] - 10https://gerrit.wikimedia.org/r/834370 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [19:41:13] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Set spark3 config on hadoop workers, test install only on one worker [puppet] - 10https://gerrit.wikimedia.org/r/834370 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [19:45:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [19:45:51] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:47:01] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:03] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:47:53] (03PS1) 10Ottomata: Install spark3 via conda-analytics on all stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/834371 (https://phabricator.wikimedia.org/T312882) [19:49:19] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:49:58] (03CR) 10Ottomata: [C: 03+2] Install spark3 via conda-analytics on all stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/834371 (https://phabricator.wikimedia.org/T312882) (owner: 10Ottomata) [19:56:52] (03PS1) 10Bking: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) [19:58:21] (03PS2) 10Bking: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) [20:00:04] brennen and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220922T2000). [20:00:04] Tpt, Jhs, zabe, and arlolra: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:18] Hi! I'm around [20:00:23] o/ [20:00:40] here [20:00:57] (03CR) 10CI reject: [V: 04-1] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:01:57] o/ [20:02:13] Tpt: starting with yours [20:02:21] thank you! [20:03:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/ProofreadPage] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833817 (https://phabricator.wikimedia.org/T318266) (owner: 10Tpt) [20:04:14] hey, i'm here [20:04:26] not too late for my patch, i hope :) [20:04:28] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37333/console" [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:04:49] we're still on the first one, so you're good. :) [20:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:05:00] cool. [20:05:22] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) Sure! I will power it off tomorrow in the EU morning and leave it off so you can change it anytime you like. [20:07:05] (03PS3) 10Bking: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) [20:08:25] gonna go ahead and +2 Jhs's config patch to save a bit of time here. [20:08:35] thx [20:09:11] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:09:13] i just got off work at my 2nd job (forgot i had a late shift today when i scheduled the patch), so i'm in my car and a mobile hotspot, lol [20:09:15] (03CR) 10Brennen Bearnes: [C: 03+2] Add logo icon and wordmark for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834059 (https://phabricator.wikimedia.org/T318341) (owner: 10Jon Harald Søby) [20:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:10:00] (03Merged) 10jenkins-bot: Add logo icon and wordmark for nowikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834059 (https://phabricator.wikimedia.org/T318341) (owner: 10Jon Harald Søby) [20:10:57] (03PS4) 10Bking: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) [20:11:47] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:13:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:54] (HelmReleaseBadStatus) firing: Helm release eventstreams-internal/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=eventstreams-internal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:17:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:18:09] (03Merged) 10jenkins-bot: Drops JS-side creation of "Source" link [extensions/ProofreadPage] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/833817 (https://phabricator.wikimedia.org/T318266) (owner: 10Tpt) [20:19:02] !log brennen@deploy1002 Started scap: Backport for [[gerrit:833817|Drops JS-side creation of "Source" link (T318266)]] [20:19:06] T318266: The proofreading progress indicator has a 0 size width on 1.40.0-wmf.2 - https://phabricator.wikimedia.org/T318266 [20:19:21] !log brennen@deploy1002 brennen and tpt: Backport for [[gerrit:833817|Drops JS-side creation of "Source" link (T318266)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:19:29] thank you! [20:19:49] Tpt, Jhs - both of those changes can be checked on mwdebug1001 [20:20:22] brennen, mine works as intended 👍 [20:20:47] brennen: My change works fine too! Thank you! [20:20:53] thx both, syncing [20:22:43] (03PS5) 10Bking: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) [20:22:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [20:23:09] zabe: yours is up next for beta - want to go ahead with that read-only? [20:23:28] yes [20:24:48] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37335/console" [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:25:12] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:833817|Drops JS-side creation of "Source" link (T318266)]] (duration: 06m 09s) [20:25:17] T318266: The proofreading progress indicator has a 0 size width on 1.40.0-wmf.2 - https://phabricator.wikimedia.org/T318266 [20:25:17] zabe: ready when you are [20:25:18] brennen, I am ready [20:25:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834058 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [20:26:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:03] (03Merged) 10jenkins-bot: beta: Promote deployment-db09 as master, decom deployment-db07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834058 (https://phabricator.wikimedia.org/T318126) (owner: 10Zabe) [20:27:29] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+1] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:27:38] kicked beta-code-update-eqiad [20:27:38] (03CR) 10Bking: [C: 03+2] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/834372 (https://phabricator.wikimedia.org/T302530) (owner: 10Bking) [20:27:43] cool, thx [20:27:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:55] arlolra: yours next [20:28:08] ready [20:28:40] arlolra: cool if these both go out at once? [20:29:12] yes please [20:29:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [20:29:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [20:29:53] (03PS2) 10Vlad.shapik: Update the logic to run test coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [20:30:58] (03PS1) 10Ryan Kemper: Revert "elastic: add instances and ports to motd" [puppet] - 10https://gerrit.wikimedia.org/r/833823 [20:31:05] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: add instances and ports to motd" [puppet] - 10https://gerrit.wikimedia.org/r/833823 (owner: 10Ryan Kemper) [20:32:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [20:32:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [20:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:59] (debugging some `scap backport` features so this might be a bit noisier than usual.) [20:34:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:07] (03PS1) 10Ryan Kemper: Revert "Revert "elastic: add instances and ports to motd"" [puppet] - 10https://gerrit.wikimedia.org/r/833824 [20:34:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [20:34:22] (03PS2) 10Ryan Kemper: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) [20:34:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:36:18] !log brennen@deploy1002 backport aborted: (duration: 02m 16s) [20:36:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [20:36:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [20:37:41] (03PS3) 10Ryan Kemper: elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) [20:39:34] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37336/console" [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) (owner: 10Ryan Kemper) [20:39:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) (owner: 10Ryan Kemper) [20:41:03] (03CR) 10Bking: [V: 03+1] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) (owner: 10Ryan Kemper) [20:41:19] (03CR) 10Bking: [V: 03+1 C: 03+1] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) (owner: 10Ryan Kemper) [20:41:25] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] elastic: add instances and ports to motd [puppet] - 10https://gerrit.wikimedia.org/r/833824 (https://phabricator.wikimedia.org/T302530) (owner: 10Ryan Kemper) [20:47:21] (03Merged) 10jenkins-bot: Restrict figure to the size of the media [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834364 (https://phabricator.wikimedia.org/T305357) (owner: 10Arlolra) [20:47:24] (03Merged) 10jenkins-bot: Fix media alignment since disabling wgParserEnableLegacyMediaDOM [skins/MinervaNeue] (wmf/1.40.0-wmf.2) - 10https://gerrit.wikimedia.org/r/834366 (https://phabricator.wikimedia.org/T318300) (owner: 10Arlolra) [20:47:42] !log brennen@deploy1002 Started scap: Backport for [[gerrit:834364|Restrict figure to the size of the media (T305357 T318300)]], [[gerrit:834366|Fix media alignment since disabling wgParserEnableLegacyMediaDOM (T318300)]] [20:47:47] T305357: Long captions overflow their parent - https://phabricator.wikimedia.org/T305357 [20:47:47] T318300: Media alignment broken with MinervaNeue since disabling wgParserEnableLegacyMediaDOM - https://phabricator.wikimedia.org/T318300 [20:48:02] !log brennen@deploy1002 brennen and arlolra: Backport for [[gerrit:834364|Restrict figure to the size of the media (T305357 T318300)]], [[gerrit:834366|Fix media alignment since disabling wgParserEnableLegacyMediaDOM (T318300)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:48:37] arlolra: both should be testable [20:49:37] yes, looks great, thanks [20:49:44] cool, syncing [20:53:13] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:53:29] !log joal@deploy1002 Started deploy [airflow-dags/analytics@6c81e6f]: (no justification provided) [20:53:39] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@6c81e6f]: (no justification provided) (duration: 00m 10s) [20:54:16] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:834364|Restrict figure to the size of the media (T305357 T318300)]], [[gerrit:834366|Fix media alignment since disabling wgParserEnableLegacyMediaDOM (T318300)]] (duration: 06m 33s) [20:54:21] T305357: Long captions overflow their parent - https://phabricator.wikimedia.org/T305357 [20:54:21] T318300: Media alignment broken with MinervaNeue since disabling wgParserEnableLegacyMediaDOM - https://phabricator.wikimedia.org/T318300 [20:55:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:55:15] !log end of utc late backport & config window [20:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:19] thanks all. [20:56:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:56:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:56:31] thank you [20:56:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:21:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [21:22:15] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Zabe) [21:23:11] RECOVERY - SSH on db1098.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:23:24] !log dancy@deploy1002 backport aborted: (duration: 00m 05s) [21:23:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dancy@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833819 (owner: 10Ahmon Dancy) [21:24:33] (03Merged) 10jenkins-bot: Revert "InitialiseSettings-labs.php: Added test text (to be reverted)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/833819 (owner: 10Ahmon Dancy) [21:27:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:28:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:28:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:29:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:00:17] PROBLEM - SSH on analytics1076.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:52] !log joal@deploy1002 Started deploy [airflow-dags/analytics@901f810]: (no justification provided) [22:20:03] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@901f810]: (no justification provided) (duration: 00m 11s) [22:54:09] PROBLEM - SSH on restbase2012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:27] legoktm: ping [22:56:41] pong [22:58:20] legoktm: you said that when the webserver replies with a 429, that it has some kind of retry-after field correct? [22:58:33] There should be a header, yes [22:58:36] Can you provide documentation about this so I can implement support for it? [22:58:53] It's not like I can willingly 429 the web service. :-) [22:58:56] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After [22:59:46] legoktm: thank you. Do you use delay seconds or HTTP date? [23:00:16] I've only seen delay seconds [23:00:24] Awesome. [23:00:35] That will make it easy to implement then. [23:01:25] https://gitlab.com/mwbot-rs/mwbot/-/blob/master/mwapi/src/client.rs#L478 is how I do it [23:01:35] RECOVERY - SSH on analytics1076.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:02:12] Nice. [23:02:22] Get header, pass to sleep(). :p [23:04:02] pretty much [23:13:23] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:25:24] (03PS1) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/834398 (https://phabricator.wikimedia.org/T318382) [23:25:26] (03PS1) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [23:25:30] (03PS1) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) [23:26:44] (03CR) 10CI reject: [V: 04-1] P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [23:29:09] (03PS2) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) [23:29:11] (03PS2) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382)