[00:23:55] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:17] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:55:25] (03PS1) 10BryanDavis: bullseye: add bzip2 and zstd compression programs [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) [00:55:31] (03PS1) 10BryanDavis: mysql: new image for mysql backups [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [02:20:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [02:49:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [02:51:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:30:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:35:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:54:17] (03PS24) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [03:58:01] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:58:29] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [03:58:45] (03PS25) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [04:18:25] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:48:01] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:49:32] (03PS1) 10Zabe: Add Apache configuration for vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/843001 (https://phabricator.wikimedia.org/T320890) [04:53:17] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [05:09:50] (03PS2) 10PleaseStand: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 [05:20:20] (03PS1) 10KartikMistry: Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) [05:22:17] (03CR) 10Santhosh: [C: 03+1] Enable specialcontribute campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843003 (https://phabricator.wikimedia.org/T319306) (owner: 10KartikMistry) [06:08:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 16276 [06:14:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 16276 [06:39:21] (03PS13) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [06:48:49] (03PS14) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [06:53:25] (03PS15) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [06:56:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37578/console" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T0700). [07:00:05] PleaseStand: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] (03PS16) 10Slyngshede: WIP: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [07:00:20] Amir1 and Urbanecm: hi, I'm here [07:01:00] !log powercycle parse1002 - serial console's tty not responding, OEM events registered in `racadm getsel` [07:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:05] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:03:14] (03PS17) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [07:03:33] PROBLEM - Check systemd state on parse1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-phpfpm-statustext-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:49] RECOVERY - Check systemd state on parse1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:06:38] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37579/console" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:17:25] (03PS2) 10Elukey: admin_ng: set higher circuit breaking limits for EventGate on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/842494 (https://phabricator.wikimedia.org/T320374) [07:18:02] (03CR) 10Majavah: [C: 03+2] perl532: Add libbytes-random-secure-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842991 (https://phabricator.wikimedia.org/T320824) (owner: 10BryanDavis) [07:19:23] (03Merged) 10jenkins-bot: perl532: Add libbytes-random-secure-perl [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842991 (https://phabricator.wikimedia.org/T320824) (owner: 10BryanDavis) [07:20:42] hashar: Is there any issue with Docker in CI. See: https://integration.wikimedia.org/ci/job/cxserver-pipeline-test/170/console -- not sure what exactly is wrong. [07:24:39] (03CR) 10Elukey: [C: 03+2] admin_ng: set higher circuit breaking limits for EventGate on ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/842494 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [07:25:12] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) Can someone do the thing again? It expired today. [07:25:31] hi PleaseStand, sorry, i missed the start of the window [07:25:35] did someone start deploying already? [07:25:59] urbanecm: no? [07:28:10] PleaseStand: okay. actually, in this case, i think it'd be great to coordinate the deployment closely with SREs, to ensure it doesn't result in an accident. changing PW hashing is potentially dangerous [07:29:35] urbanecm: Fine with me, no big hurry, should I file a Phabricator task and tag SRE? [07:29:51] PleaseStand: yeah, that's a great first step. [07:30:32] (03CR) 10Urbanecm: [C: 03+2] logos: Fix bug when a variant param is specific [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842923 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [07:31:14] (03CR) 10Urbanecm: [C: 03+2] Mentee filters: always use mw.user.options values to initialise the mentees store [extensions/GrowthExperiments] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842897 (https://phabricator.wikimedia.org/T320728) (owner: 10Urbanecm) [07:31:52] (03Merged) 10jenkins-bot: logos: Fix bug when a variant param is specific [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842923 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [07:32:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842897 (https://phabricator.wikimedia.org/T320728) (owner: 10Urbanecm) [07:34:02] !log set thanos ring replicas to 3.60 T311690 [07:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:07] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [07:37:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:37:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:37:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:38:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:38:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:38:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:41:05] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Update makevm to include completion of the installation with the puppet runs - https://phabricator.wikimedia.org/T306661 (10ayounsi) [07:42:07] (03PS3) 10Majavah: api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 [07:42:28] (03CR) 10Majavah: api: support sha256 checksums (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah) [07:43:22] (03PS3) 10Majavah: api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 [07:45:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/842850 (owner: 10Jbond) [07:47:27] (03PS1) 10Jcrespo: backup: Increase the maximum amount of volumes of esrw (backup*003) [puppet] - 10https://gerrit.wikimedia.org/r/843407 (https://phabricator.wikimedia.org/T313582) [07:48:22] (03CR) 10DCausse: [C: 03+1] Added structured data team [puppet] - 10https://gerrit.wikimedia.org/r/842418 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [07:48:47] (03PS2) 10Jcrespo: backup: Increase the maximum amount of volumes of esrw (backup*003) [puppet] - 10https://gerrit.wikimedia.org/r/843407 (https://phabricator.wikimedia.org/T313582) [07:50:04] (03Merged) 10jenkins-bot: Mentee filters: always use mw.user.options values to initialise the mentees store [extensions/GrowthExperiments] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/842897 (https://phabricator.wikimedia.org/T320728) (owner: 10Urbanecm) [07:50:22] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:842897|Mentee filters: always use mw.user.options values to initialise the mentees store (T320728)]] [07:50:28] T320728: Mentee overview(vue): Empty string in "Maximum"/"Minimum" filter options is not persisted - https://phabricator.wikimedia.org/T320728 [07:50:48] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:842897|Mentee filters: always use mw.user.options values to initialise the mentees store (T320728)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:54:07] (03CR) 10Jcrespo: [C: 03+2] backup: Increase the maximum amount of volumes of esrw (backup*003) [puppet] - 10https://gerrit.wikimedia.org/r/843407 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [07:55:24] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Joe) p:05Medium→03High Please #traffic take a look at this proble... [07:57:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:842897|Mentee filters: always use mw.user.options values to initialise the mentees store (T320728)]] (duration: 07m 22s) [07:57:49] T320728: Mentee overview(vue): Empty string in "Maximum"/"Minimum" filter options is not persisted - https://phabricator.wikimedia.org/T320728 [07:57:52] * urbanecm done [08:02:26] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] debian: add packaging [debs/benthos] - 10https://gerrit.wikimedia.org/r/842808 (owner: 10Filippo Giunchedi) [08:03:33] !log restarting several bacula-related daemons to update its configuration [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:07] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:18:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:22:34] (03CR) 10Filippo Giunchedi: [C: 03+2] Added structured data team [puppet] - 10https://gerrit.wikimedia.org/r/842418 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [08:31:44] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Vgutierrez) The current deployment-prep instances are pretty far from... [08:34:13] 10SRE: Deploy MediaWiki config change to use OpenSSL for PBKDF2 password hashing - https://phabricator.wikimedia.org/T320929 (10PleaseStand) [08:34:27] (03PS3) 10PleaseStand: Use OpenSSL for PBKDF2 password hashing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842522 (https://phabricator.wikimedia.org/T320929) [08:34:57] urbanecm: thanks, I filed https://phabricator.wikimedia.org/T320929 [08:35:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Vgutierrez) I've created T320930 to track this [08:49:39] 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10hnowlan) I'd be curious to hear @Jgiannelos's input on this one - if we want to not bother rewriting Kartotherian to speak to Prometheus directly via the... [08:50:31] (03PS1) 10Ayounsi: Move all eqiad VRRP mastership to cr2 [homer/public] - 10https://gerrit.wikimedia.org/r/843414 (https://phabricator.wikimedia.org/T320566) [08:52:59] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:53:06] (03CR) 10Vgutierrez: [C: 03+2] api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah) [08:53:17] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:54:45] (03CR) 10Ayounsi: [C: 03+2] Move all eqiad VRRP mastership to cr2 [homer/public] - 10https://gerrit.wikimedia.org/r/843414 (https://phabricator.wikimedia.org/T320566) (owner: 10Ayounsi) [08:55:14] !log Move all eqiad VRRP mastership to cr2 - T320566 [08:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:19] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [08:57:21] (03CR) 10Vgutierrez: "looking good, could you supply a test for this CR as well? Thanks!" [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 (owner: 10Majavah) [08:58:18] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 02), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Clement_Goubert) In preparation of the redeploy, I lowered the TTL for service discovery to 30 second... [08:59:48] (03Merged) 10jenkins-bot: api: support sha256 checksums [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806939 (owner: 10Majavah) [09:00:14] 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10Jgiannelos) The effort required to configure service runner to migrate from statsd to prometheus is not that much (its abstracted so its a matter of confi... [09:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:05:27] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:09:04] !log de-pref cr1-eqiad wavelength transports (to codfw and drmrs) - T320566 [09:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:09] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [09:09:13] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:12:01] (03PS1) 10Ayounsi: Drain eqiad-drmrs GTT link [homer/public] - 10https://gerrit.wikimedia.org/r/843416 (https://phabricator.wikimedia.org/T320566) [09:12:31] (03PS4) 10MdsShakil: Enable Sandbox Extension at Bengali Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842967 (https://phabricator.wikimedia.org/T320903) [09:16:42] (03PS2) 10Ayounsi: Drain eqiad-drmrs GTT link [homer/public] - 10https://gerrit.wikimedia.org/r/843416 (https://phabricator.wikimedia.org/T320566) [09:21:28] _joe_: Did you have a chance to look at the maintenance script parameter patch, yet? https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148 [09:21:47] <_joe_> hoo: look, yes, and it lgtm [09:22:02] <_joe_> I could not merge it on friday, I plan to merge it today [09:23:03] Great, thanks :) [09:24:12] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10Vgutierrez) [09:24:18] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Ladsgroup) I have depooled this. I'm tiny bit confused, the state is degraded (according to megacli -LDPDInfo -aAll) but the errors are all zero and state all are working: ` root@db1202:~#... [09:24:31] !log de-pref eqiad-drmrs GTT VPLS (latency between eqiad and drmrs will increase) - T320566 [09:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:36] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [09:24:45] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10dcaro) a:03aborrero +1 [09:25:04] cc vgutierrez ^ [09:25:19] ack [09:30:42] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Ladsgroup) ` root@db1202:~# sudo /usr/local/lib/nagios/plugins/get-raid-status-perccli communication 0 OK | controller: 1 Needs Attention | physical_disk: 0 OK | virtual_disk: 1 Dgrd | bbu:... [09:35:33] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:50] Emperor: ^^ [09:38:37] (03CR) 10JMeybohm: [C: 04-1] "Thanks for taking this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [09:42:40] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10aborrero) [09:42:54] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10aborrero) 05Open→03Resolved [09:46:00] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10Vgutierrez) Thanks @aborrero && @dcaro [09:46:51] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:33] (PuppetFailure) firing: Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:47:52] (03PS2) 10Clément Goubert: Remove references to deprecated kubeyaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) [09:48:05] (03CR) 10Clément Goubert: Remove references to deprecated kubeyaml (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [09:48:39] !log powercycle db1202 [09:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:54] godog: the alert hosts puppet failure could be related to your earlier merge for the structured-data team addition? [09:51:11] amtool: error: failed to validate 1 file(s) [09:51:11] Checking '/etc/prometheus/alertmanager.yml20221017-21001-qa8d4e' FAILED: yaml: line 158: did not find expected key [09:51:20] volans: oh yeah totally, thank you I missed it [09:52:03] 10SRE: admin: Add validation checks for missing realname and email in data.yaml - https://phabricator.wikimedia.org/T320937 (10PleaseStand) [09:52:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:53:12] (03PS1) 10Filippo Giunchedi: alertmanager: fix yaml for structured-data AM router [puppet] - 10https://gerrit.wikimedia.org/r/843420 (https://phabricator.wikimedia.org/T312235) [09:53:59] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) [09:54:15] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix yaml for structured-data AM router [puppet] - 10https://gerrit.wikimedia.org/r/843420 (https://phabricator.wikimedia.org/T312235) (owner: 10Filippo Giunchedi) [09:58:33] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) `lines=10 -------------------------------------------------------------------------------- SeqNumber = 323... [09:59:54] (03PS4) 10Majavah: api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 [10:06:05] (03CR) 10CI reject: [V: 04-1] api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 (owner: 10Majavah) [10:06:15] (03PS3) 10PleaseStand: admin: Clean up duplication in schema.yaml [puppet] - 10https://gerrit.wikimedia.org/r/820891 (https://phabricator.wikimedia.org/T320937) [10:06:19] (03PS5) 10PleaseStand: admin: Add realname, email existence constraints to schema.yaml [puppet] - 10https://gerrit.wikimedia.org/r/820862 (https://phabricator.wikimedia.org/T320937) [10:08:12] (03PS5) 10Majavah: api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 [10:16:17] (03PS1) 10Filippo Giunchedi: alertmanager: fix #2 for structured-data route [puppet] - 10https://gerrit.wikimedia.org/r/843423 [10:17:08] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix #2 for structured-data route [puppet] - 10https://gerrit.wikimedia.org/r/843423 (owner: 10Filippo Giunchedi) [10:22:33] (PuppetFailure) firing: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:27:42] !log disable cr1-eqiad:ae4 for recabling and troubleshooting - T320566 [10:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:46] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [10:29:27] (PuppetFailure) resolved: (2) Puppet has failed on alerting hosts - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:34:14] (03PS1) 10Clément Goubert: mwdebug: Disable nutcracker [deployment-charts] - 10https://gerrit.wikimedia.org/r/843425 [10:39:01] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 3 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:41:05] 10SRE, 10observability, 10serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748 (10hnowlan) I hadn't considered how we get traffic to Kartotherian - for the most part we just directly rewrite requests for maps.wikimedia.org to kartotheri... [10:42:43] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:42:48] XioNoX: FYI VRRP status on cr1-eqiad ^^^ [10:44:26] volans: yeah, we're troubleshooting cr1<->rowD issue [10:44:30] thx for the ping though [10:45:08] ack, yes I knew there was some WIP but not 100% sure if relevant ;) [10:49:03] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) 05Stalled→03In progress Spawning deployment-cache-text07 && deployment-cache-upload07... [10:49:41] (03PS11) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:51:27] (03PS12) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [10:55:27] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:59:09] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:09:43] !log shutting down BGP sessions from cr1-eqiad to lsw1-e1-eqiad in advance of linecard reboot [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:02] (03PS1) 10Vgutierrez: cache::haproxy: Allow disabling monitoring [puppet] - 10https://gerrit.wikimedia.org/r/843470 (https://phabricator.wikimedia.org/T320930) [11:11:15] !log cr1-eqiad> request chassis fpc slot 1 offline - T320566 [11:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:19] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [11:12:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37580/console" [puppet] - 10https://gerrit.wikimedia.org/r/843470 (https://phabricator.wikimedia.org/T320930) (owner: 10Vgutierrez) [11:13:02] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Allow disabling monitoring [puppet] - 10https://gerrit.wikimedia.org/r/843470 (https://phabricator.wikimedia.org/T320930) (owner: 10Vgutierrez) [11:15:47] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:45] 10SRE-swift-storage, 10Community-Tech, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10LSobanski) [11:38:56] !log moving et-1/1/3 out of ae bundle on cr1-eqiad [11:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:51] 10SRE-swift-storage: IPv6 records inconsistent on the ms-be hosts - https://phabricator.wikimedia.org/T320947 (10LSobanski) [11:41:32] !log moving port et-2/0/49 out of ae1 bundle asw2-d-eqiad [11:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:54] 10SRE-swift-storage, 10Data-Engineering, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10LSobanski) [11:43:53] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10LSobanski) [11:43:57] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable the vue version of mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) [11:44:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:44:52] ^^ this is due to work me and arzhel doing [11:45:45] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:00:13] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:14] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [12:10:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:10:27] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:10:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:19:27] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:23:37] (03PS1) 10Stang: Fix broken wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843483 (https://phabricator.wikimedia.org/T320944) [12:27:54] (03PS1) 10Stang: logos: Set a higher precision for scour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843484 (https://phabricator.wikimedia.org/T307705) [12:27:59] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10lbowmaker) [12:32:59] (03PS2) 10Stang: logos: Set a higher precision for scour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843484 (https://phabricator.wikimedia.org/T307705) [12:36:54] !log re-enable BGP between cr1 and lsw1-e1 - T320566 [12:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:59] T320566: Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 [12:37:45] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:09] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:54:30] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) [12:55:41] (03PS1) 10Volans: sre.discovery.service-route: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/843488 [12:55:43] (03PS1) 10Volans: sre.hadoop.reboot-workers: remove unused __title__ [cookbooks] - 10https://gerrit.wikimedia.org/r/843489 [12:56:21] (03CR) 10Clément Goubert: [C: 03+1] sre.discovery.service-route: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/843488 (owner: 10Volans) [12:56:54] (03CR) 10Volans: [C: 03+2] "trivial cleanup, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/843489 (owner: 10Volans) [12:57:02] (03CR) 10Volans: [C: 03+2] sre.discovery.service-route: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/843488 (owner: 10Volans) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T1300). [13:00:05] koi, MdsShakil, sergi0, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:15] hi [13:00:20] hallo [13:00:23] i can deploy today! [13:00:52] (03Merged) 10jenkins-bot: sre.discovery.service-route: fix argument parsing [cookbooks] - 10https://gerrit.wikimedia.org/r/843488 (owner: 10Volans) [13:01:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [13:01:26] (03PS2) 10Urbanecm: frwiktionary: Upload the correct tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842916 (https://phabricator.wikimedia.org/T320840) (owner: 10Stang) [13:01:29] (03CR) 10Urbanecm: [C: 03+2] frwiktionary: Upload the correct tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842916 (https://phabricator.wikimedia.org/T320840) (owner: 10Stang) [13:01:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2159.codfw.wmnet with reason: Maintenance [13:01:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:01:39] thanks urbanecm [13:01:42] (03Merged) 10jenkins-bot: sre.hadoop.reboot-workers: remove unused __title__ [cookbooks] - 10https://gerrit.wikimedia.org/r/843489 (owner: 10Volans) [13:01:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [13:01:50] (03PS2) 10Urbanecm: Fix broken wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843483 (https://phabricator.wikimedia.org/T320944) (owner: 10Stang) [13:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T318950)', diff saved to https://phabricator.wikimedia.org/P35500 and previous config saved to /var/cache/conftool/dbconfig/20221017-130154-ladsgroup.json [13:01:56] (03CR) 10Urbanecm: [C: 03+2] Fix broken wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843483 (https://phabricator.wikimedia.org/T320944) (owner: 10Stang) [13:01:59] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:02:12] (03Merged) 10jenkins-bot: frwiktionary: Upload the correct tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842916 (https://phabricator.wikimedia.org/T320840) (owner: 10Stang) [13:02:45] (03Merged) 10jenkins-bot: Fix broken wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843483 (https://phabricator.wikimedia.org/T320944) (owner: 10Stang) [13:03:03] kostajh: your first two patches are at mwdebug1001, please check! [13:03:38] urbanecm: I just have one patch, did you mean to tag koi ? [13:03:42] eh, yes [13:03:44] koi: ^^ [13:03:46] sorry [13:03:52] looking [13:03:54] * urbanecm should stop using k + tab [13:04:08] (03PS2) 10Urbanecm: dewiktionary: Add new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842986 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:04:12] (03CR) 10Urbanecm: [C: 03+2] dewiktionary: Add new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842986 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318950)', diff saved to https://phabricator.wikimedia.org/P35501 and previous config saved to /var/cache/conftool/dbconfig/20221017-130412-ladsgroup.json [13:04:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:04:17] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:04:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:04:23] (03PS2) 10Urbanecm: dewiktionary: Update logos defined in logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842987 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:04:25] (03CR) 10Urbanecm: [C: 03+2] dewiktionary: Update logos defined in logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842987 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:04:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2105.codfw.wmnet with reason: Maintenance [13:04:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P35502 and previous config saved to /var/cache/conftool/dbconfig/20221017-130440-ladsgroup.json [13:04:45] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:04:58] (03Merged) 10jenkins-bot: dewiktionary: Add new logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842986 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:05:10] MdsShakil: hi, your commit will soon be deployed, are you around? [13:05:10] (03Merged) 10jenkins-bot: dewiktionary: Update logos defined in logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842987 (https://phabricator.wikimedia.org/T320891) (owner: 10Stang) [13:05:30] Yes [13:05:30] urbanecm: I tested on all those sites, and LGTM [13:05:35] great, syncing! [13:05:39] MdsShakil: okay, i'll ping you when ready [13:05:40] !log urbanecm@deploy1002 Started scap: b434c5a84: 9d10a60ea: Wordmark changes (T320944, T320840) [13:05:46] T320840: Vector 2022: Wrong tagline for other site displayed under logo - https://phabricator.wikimedia.org/T320840 [13:05:47] T320944: Taglline of German Wikipedia on vector-2022 broken - https://phabricator.wikimedia.org/T320944 [13:05:47] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) [13:06:06] !log root@cumin1001 START - Cookbook sre.discovery.service-route [13:06:08] !log root@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [13:07:01] hi urbanecm, would you like to give a +2 to this patch https://gerrit.wikimedia.org/r/843484, maybe at the end of this window [13:07:06] (03PS1) 10Vgutierrez: varnish::frontend: Allow disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/843491 (https://phabricator.wikimedia.org/T320930) [13:07:08] found another bug 0 0 [13:07:10] urbanecm: I'll want to verify my config patch by running a maintenance script on mwmaint, will that work? [13:07:26] (03PS1) 10Ottomata: Eventlogging - Stop refining decomissioned EditConflict events [puppet] - 10https://gerrit.wikimedia.org/r/843492 (https://phabricator.wikimedia.org/T318258) [13:07:49] kostajh: yes, so long you run `scap pull` at mwmaint before running your script [13:08:24] (03PS3) 10Urbanecm: logos: Set a higher precision for scour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843484 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:08:27] (03CR) 10Urbanecm: [C: 03+2] logos: Set a higher precision for scour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843484 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:08:54] koi: +2'ed now, thanks for the fix. [13:08:57] thanks! [13:09:09] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37581/console" [puppet] - 10https://gerrit.wikimedia.org/r/843491 (https://phabricator.wikimedia.org/T320930) (owner: 10Vgutierrez) [13:09:13] (03Merged) 10jenkins-bot: logos: Set a higher precision for scour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843484 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [13:09:38] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish::frontend: Allow disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/843491 (https://phabricator.wikimedia.org/T320930) (owner: 10Vgutierrez) [13:10:12] !log urbanecm@deploy1002 Finished scap: b434c5a84: 9d10a60ea: Wordmark changes (T320944, T320840) (duration: 04m 32s) [13:11:34] koi: your other three patches are now at mwdebug1001, please test [13:11:38] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) [13:11:46] looking [13:11:49] (03PS5) 10Urbanecm: Enable Sandbox Extension at Bengali Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842967 (https://phabricator.wikimedia.org/T320903) (owner: 10MdsShakil) [13:11:55] (03CR) 10Urbanecm: [C: 03+2] Enable Sandbox Extension at Bengali Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842967 (https://phabricator.wikimedia.org/T320903) (owner: 10MdsShakil) [13:12:19] (03CR) 10Ottomata: [C: 03+2] Eventlogging - Stop refining decomissioned EditConflict events [puppet] - 10https://gerrit.wikimedia.org/r/843492 (https://phabricator.wikimedia.org/T318258) (owner: 10Ottomata) [13:12:41] (03Merged) 10jenkins-bot: Enable Sandbox Extension at Bengali Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842967 (https://phabricator.wikimedia.org/T320903) (owner: 10MdsShakil) [13:12:55] urbanecm: the new logo for dewiktionary LGTM [13:13:11] !log urbanecm@deploy1002 Started scap: 52821e09c: 35000a4b: dewiktionary: Update logo (T320891) [13:13:13] great, syncing [13:13:16] T320891: Requesting logo change for de.wiktionary.org - https://phabricator.wikimedia.org/T320891 [13:13:17] (probably the 1x logo should be purged after scap [13:13:25] yup yup [13:17:14] !log urbanecm@deploy1002 Finished scap: 52821e09c: 35000a4b: dewiktionary: Update logo (T320891) (duration: 04m 03s) [13:17:18] done, purged [13:17:27] koi: all your patches should now be synced [13:17:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842967 (https://phabricator.wikimedia.org/T320903) (owner: 10MdsShakil) [13:17:45] thanks a lot! [13:17:51] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:842967|Enable Sandbox Extension at Bengali Wikiquote (T320903)]] [13:17:56] T320903: Enable Sandbox Extension at Bengali Wikiquote - https://phabricator.wikimedia.org/T320903 [13:18:11] !log urbanecm@deploy1002 urbanecm and mdsshakil: Backport for [[gerrit:842967|Enable Sandbox Extension at Bengali Wikiquote (T320903)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [13:18:17] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10taavi) [13:18:22] MdsShakil: your patch is at mwdebug1001, can you check? [13:18:32] (03PS2) 10Urbanecm: GrowthExperiments: enable the vue version of mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:18:40] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable the vue version of mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:18:50] urbanecm: looking good to me [13:18:56] great, syncing [13:18:58] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) deployment-cache-text07 is up & running: ` vgutierrez@deployment-cache-text07:~$ curl --conne... [13:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P35503 and previous config saved to /var/cache/conftool/dbconfig/20221017-131911-ladsgroup.json [13:19:16] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [13:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P35504 and previous config saved to /var/cache/conftool/dbconfig/20221017-131918-ladsgroup.json [13:19:32] (03Merged) 10jenkins-bot: GrowthExperiments: enable the vue version of mentee overview in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:19:41] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:19:59] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) p:05Triage→03Medium [13:22:46] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:842967|Enable Sandbox Extension at Bengali Wikiquote (T320903)]] (duration: 04m 54s) [13:22:51] MdsShakil: and should be live now! [13:23:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843481 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:23:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:843481|GrowthExperiments: enable the vue version of mentee overview in all wikis (T300532)]] [13:23:15] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [13:23:30] !log urbanecm@deploy1002 urbanecm and sgimeno: Backport for [[gerrit:843481|GrowthExperiments: enable the vue version of mentee overview in all wikis (T300532)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:23:33] sergi0_: your patch is now at mwdebug1001, can you check please? [13:23:47] checking [13:23:49] urbanecm: Thanks [13:24:19] no problem MdsShakil [13:25:14] looking good to me in enwiki [13:25:41] yep, lgtm too, so, let's sync? [13:25:50] yes [13:25:53] doing [13:26:00] (03PS4) 10Urbanecm: GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:26:02] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:27:21] (03Merged) 10jenkins-bot: GrowthExperiments: Enable link recommendation for 5th round wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:27:48] !log Depooling eventgate-logging-external in codfw - T303543 [13:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:53] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [13:28:17] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-logging-external,name=codfw [13:29:35] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:843481|GrowthExperiments: enable the vue version of mentee overview in all wikis (T300532)]] (duration: 06m 24s) [13:29:40] T300532: Migration of mentee overview to Vue - https://phabricator.wikimedia.org/T300532 [13:29:42] sergi0_: should be live now! [13:29:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843487 (https://phabricator.wikimedia.org/T304549) (owner: 10Kosta Harlan) [13:30:02] kostajh: your patch's going now... [13:30:09] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:843487|GrowthExperiments: Enable link recommendation for 5th round wikis (T304549)]] [13:30:18] urbanecm: it is indeed. Thanks a lot! [13:30:22] no problem! [13:30:23] !log urbanecm@deploy1002 urbanecm and kharlan: Backport for [[gerrit:843487|GrowthExperiments: Enable link recommendation for 5th round wikis (T304549)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:30:28] congrats to finishing the Vue version :) [13:30:33] \o/ [13:30:35] kostajh: your patch is at mwdebug1001, can you check? [13:30:50] feel free to test at mwmaint1002 instead (by scap pull'ing there) [13:31:07] you can also run maintenance scripts at mwdebug1001 directly [13:31:28] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [13:31:55] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [13:32:49] urbanecm: ok, I'll try mwdebug [13:32:57] ok [13:33:45] poor bot :/ [13:34:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [13:34:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P35505 and previous config saved to /var/cache/conftool/dbconfig/20221017-133417-ladsgroup.json [13:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P35506 and previous config saved to /var/cache/conftool/dbconfig/20221017-133424-ladsgroup.json [13:34:40] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [13:35:09] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=codfw [13:35:17] !log Repooling eventgate-logging-external in codfw - T303543 [13:37:16] urbanecm: lgtm [13:37:21] great, syncing [13:38:49] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) deployment-cache-upload07 is up & running as well: ` vgutierrez@deployment-cache-upload07:~$ curl -I --connect-to u... [13:39:06] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-logging-external,name=eqiad [13:40:56] !log Depooling eventgate-logging-external in codfw - T303543 [13:41:08] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:843487|GrowthExperiments: Enable link recommendation for 5th round wikis (T304549)]] (duration: 11m 04s) [13:41:36] (03CR) 10Vgutierrez: [C: 03+2] api: Offer JSON for metadata if requested [software/acme-chief] - 10https://gerrit.wikimedia.org/r/806940 (owner: 10Majavah) [13:44:10] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [13:44:47] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [13:45:19] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-logging-external,name=eqiad [13:45:56] kostajh: and should be live [13:45:59] anything else, anyone? [13:46:39] thanks urbanecm ! [13:46:54] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [13:47:14] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [13:47:19] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [13:48:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson Netbox updated with new host kafka-jumbo1010: E1 U17 Port 17 CableID 20220240... [13:48:20] !log Repooled eventgate-logging-external in equiad - T303543 [13:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:25] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [13:48:42] s/equiad/eqiad claime ^^ [13:48:50] French :P [13:49:01] u always after q so I typo that a lot [13:49:05] /o\ [13:49:11] Should I fix it directly in SAL ? [13:49:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:49:14] yeah.. same in Spanish [13:49:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P35507 and previous config saved to /var/cache/conftool/dbconfig/20221017-134924-ladsgroup.json [13:49:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T318950)', diff saved to https://phabricator.wikimedia.org/P35508 and previous config saved to /var/cache/conftool/dbconfig/20221017-134931-ladsgroup.json [13:49:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:49:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:49:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2168.codfw.wmnet with reason: Maintenance [13:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35509 and previous config saved to /var/cache/conftool/dbconfig/20221017-134953-ladsgroup.json [13:50:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10herron) [13:50:15] !log Depooling eventgate-analytics in codfw - T303543 [13:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:42] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-analytics,name=codfw [13:52:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35510 and previous config saved to /var/cache/conftool/dbconfig/20221017-135211-ladsgroup.json [13:56:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [13:56:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [13:56:55] !log Repooling eventgate-analytics in codfw - T303543 [13:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:00] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [13:57:00] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics,name=codfw [13:59:05] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:08] !log Depooling eventgate-analytics in eqiad - T303543 [14:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:12] 10SRE, 10SRE-swift-storage: Get swift (and its components) ready for v6 - https://phabricator.wikimedia.org/T317909 (10MatthewVernon) FWIW, the rings currently only have v4 addresses in. [14:00:17] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-analytics,name=eqiad [14:01:13] (03PS1) 10Vgutierrez: hieradata::deployment-prep: Bump deployment-cache-text|upload instances [puppet] - 10https://gerrit.wikimedia.org/r/843500 (https://phabricator.wikimedia.org/T320930) [14:02:38] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10ayounsi) p:05Triage→03Medium [14:02:48] (03CR) 10Vgutierrez: [C: 03+2] hieradata::deployment-prep: Bump deployment-cache-text|upload instances [puppet] - 10https://gerrit.wikimedia.org/r/843500 (https://phabricator.wikimedia.org/T320930) (owner: 10Vgutierrez) [14:04:24] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [14:04:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318955)', diff saved to https://phabricator.wikimedia.org/P35511 and previous config saved to /var/cache/conftool/dbconfig/20221017-140430-ladsgroup.json [14:04:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:04:35] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:04:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2109.codfw.wmnet with reason: Maintenance [14:04:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P35512 and previous config saved to /var/cache/conftool/dbconfig/20221017-140452-ladsgroup.json [14:04:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [14:05:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) Hello, looks like we still have a few steps to check off before proceeding with granting access: @AnnWF could you please review and sign the L3 Acknowledgement o... [14:05:13] !log Repooling eventgate-analytics in eqiad - T303543 [14:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:17] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [14:05:21] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics,name=eqiad [14:05:27] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10herron) [14:07:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35513 and previous config saved to /var/cache/conftool/dbconfig/20221017-140717-ladsgroup.json [14:09:12] !log Depooling eventgate-analytics-external in codfw - T303543 [14:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:20] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-analytics-external,name=codfw [14:11:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:12:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [14:13:33] 10SRE, 10Wikimedia-Mailing-lists: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10jijiki) [14:13:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:45] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [14:14:16] !log Repooling eventgate-analytics-external in codfw - T303543 [14:14:19] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external,name=codfw [14:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:20] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [14:15:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.585 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:16:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:16:45] !log Depooling eventgate-analytics-external in eqiad - T303543 [14:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:52] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-analytics-external,name=eqiad [14:17:26] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P35514 and previous config saved to /var/cache/conftool/dbconfig/20221017-141921-ladsgroup.json [14:19:26] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:20:18] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:20:37] 10SRE, 10serviceops, 10User-WDoran, 10User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [14:20:41] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:20:49] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:20:53] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:20:58] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:21:57] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:21:58] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:22:06] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [14:22:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P35515 and previous config saved to /var/cache/conftool/dbconfig/20221017-142224-ladsgroup.json [14:22:39] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [14:23:01] !log Repooling eventgate-analytics-external in eqiad - T303543 [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:06] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [14:23:10] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-analytics-external,name=eqiad [14:24:41] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [14:25:04] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [14:26:31] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for MHorsey - https://phabricator.wikimedia.org/T318729 (10herron) 05In progress→03Resolved a:03herron Resolving as this looks to have been completed. Please reopen if any followup is needed. Thanks! [14:27:43] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for "Stef Dunlap" - https://phabricator.wikimedia.org/T318626 (10herron) 05In progress→03Resolved a:03herron Transitioning to resolved as this look to have been completed. Please reopen if any followup is needed. Thanks! [14:27:56] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [14:28:40] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [14:29:39] !log Depooling eventgate-main in codfw - T303543 [14:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:44] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [14:29:49] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=codfw [14:33:52] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [14:34:20] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [14:34:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P35516 and previous config saved to /var/cache/conftool/dbconfig/20221017-143427-ladsgroup.json [14:34:58] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main,name=codfw [14:35:00] !log Repooling eventgate-main in codfw - T303543 [14:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:05] T303543: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 [14:36:35] !log Depooling eventgate-main in eqiad - T303543 [14:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:45] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=eqiad [14:37:16] !log on going maintenance on cr1-eqiad [14:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35517 and previous config saved to /var/cache/conftool/dbconfig/20221017-143731-ladsgroup.json [14:37:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:37:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:37:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:37:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35518 and previous config saved to /var/cache/conftool/dbconfig/20221017-143753-ladsgroup.json [14:39:59] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [14:40:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35519 and previous config saved to /var/cache/conftool/dbconfig/20221017-144011-ladsgroup.json [14:40:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [14:40:43] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [14:41:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [14:41:28] !log Repooling eventgate-main in eqiad - T303543 [14:41:30] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main,name=eqiad [14:45:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh) [14:46:46] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Clement_Goubert) 05Open→03Resolved All eventgate services redeployed, including staging environme... [14:46:54] 10SRE, 10serviceops, 10Patch-For-Review, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Clement_Goubert) [14:49:23] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P35520 and previous config saved to /var/cache/conftool/dbconfig/20221017-144934-ladsgroup.json [14:55:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P35521 and previous config saved to /var/cache/conftool/dbconfig/20221017-145517-ladsgroup.json [14:56:41] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:02:33] kind of expected a bot topic edit 2 min ago [15:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318955)', diff saved to https://phabricator.wikimedia.org/P35522 and previous config saved to /var/cache/conftool/dbconfig/20221017-150440-ladsgroup.json [15:04:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:04:46] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:04:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2139.codfw.wmnet with reason: Maintenance [15:05:41] mutante: the bots are having issues right now [15:06:10] Sariboo: ACK, thanks! [15:06:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) 05Stalled→03In progress p:05Medium→03High a:03... [15:09:06] 10SRE, 10Data Engineering Planning, 10serviceops, 10Event-Platform Value Stream (Sprint 03), 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Ottomata) Yeehaw thank you so much Clem! [15:10:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P35523 and previous config saved to /var/cache/conftool/dbconfig/20221017-151024-ladsgroup.json [15:10:56] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:12:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 243, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:17:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:18:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:18:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P35524 and previous config saved to /var/cache/conftool/dbconfig/20221017-151808-ladsgroup.json [15:18:13] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:21:52] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10MPhamWMF) [15:21:56] (03PS1) 10DLynch: Fix editattempt_block country_code not being string [extensions/WikimediaEvents] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843465 (https://phabricator.wikimedia.org/T320938) [15:22:08] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10MPhamWMF) a:03RKemper [15:23:55] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:23:55] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:24:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:24:51] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:24:57] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T318950)', diff saved to https://phabricator.wikimedia.org/P35525 and previous config saved to /var/cache/conftool/dbconfig/20221017-152531-ladsgroup.json [15:25:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:25:37] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:25:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2182.codfw.wmnet with reason: Maintenance [15:25:49] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T318950)', diff saved to https://phabricator.wikimedia.org/P35526 and previous config saved to /var/cache/conftool/dbconfig/20221017-152552-ladsgroup.json [15:28:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318950)', diff saved to https://phabricator.wikimedia.org/P35527 and previous config saved to /var/cache/conftool/dbconfig/20221017-152810-ladsgroup.json [15:30:05] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T1530). [15:30:51] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:32:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P35528 and previous config saved to /var/cache/conftool/dbconfig/20221017-153246-ladsgroup.json [15:32:52] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:34:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:28] (03PS1) 10Ayounsi: cr1-eqiad: rename GTT interface [homer/public] - 10https://gerrit.wikimedia.org/r/843513 (https://phabricator.wikimedia.org/T304712) [15:37:52] (03CR) 10Ayounsi: [C: 03+2] cr1-eqiad: rename GTT interface [homer/public] - 10https://gerrit.wikimedia.org/r/843513 (https://phabricator.wikimedia.org/T304712) (owner: 10Ayounsi) [15:38:26] (03Merged) 10jenkins-bot: cr1-eqiad: rename GTT interface [homer/public] - 10https://gerrit.wikimedia.org/r/843513 (https://phabricator.wikimedia.org/T304712) (owner: 10Ayounsi) [15:38:27] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 4/7 UP : OSPFv3: 4/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:43:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P35529 and previous config saved to /var/cache/conftool/dbconfig/20221017-154317-ladsgroup.json [15:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P35530 and previous config saved to /var/cache/conftool/dbconfig/20221017-154753-ladsgroup.json [15:52:43] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:52:56] (03PS1) 10Jdlrobson: Remove logo setting in YAML files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843514 [15:53:21] (03CR) 10Jdlrobson: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838945 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [15:53:33] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:53:37] 10SRE, 10Thumbor, 10serviceops, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10hnowlan) 05Open→03Invalid [15:53:53] 10SRE, 10Thumbor, 10serviceops, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10hnowlan) Closing in favour of T233196 for main tracking [15:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P35531 and previous config saved to /var/cache/conftool/dbconfig/20221017-155823-ladsgroup.json [16:03:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P35532 and previous config saved to /var/cache/conftool/dbconfig/20221017-160259-ladsgroup.json [16:07:49] 10SRE, 10Observability-Logging, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10jijiki) 05Open→03Resolved a:03jijiki I am closing this as it appears that it is not an issue any more, wi... [16:13:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T318950)', diff saved to https://phabricator.wikimedia.org/P35533 and previous config saved to /var/cache/conftool/dbconfig/20221017-161330-ladsgroup.json [16:13:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:16:54] 10SRE, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10serviceops-collab: ensure httpd error logs from "misc apps" (krypton) end up in logstash - https://phabricator.wikimedia.org/T216090 (10jijiki) [16:17:18] 10SRE, 10Traffic, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10Ladsgroup) 05Stalled→03Open >>! In T306223#8111941, @CDanis wrote: > Awaiting {T309651} to continue testing boldly unstalling it. [16:18:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318955)', diff saved to https://phabricator.wikimedia.org/P35534 and previous config saved to /var/cache/conftool/dbconfig/20221017-161806-ladsgroup.json [16:18:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:18:11] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:18:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:18:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:18:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:18:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P35535 and previous config saved to /var/cache/conftool/dbconfig/20221017-161843-ladsgroup.json [16:19:27] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:19:47] 10SRE, 10SecTeam-Processed, 10Security: Deprecate use of ssh-rsa keys? - https://phabricator.wikimedia.org/T311368 (10sbassett) [16:19:51] 10SRE, 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24): Delay spinner showing for graphs for 1s - https://phabricator.wikimedia.org/T256641 (10jijiki) 05Open→03Resolved a:03jijiki [16:19:56] 10SRE, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10jijiki) [16:26:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P35536 and previous config saved to /var/cache/conftool/dbconfig/20221017-162636-ladsgroup.json [16:26:41] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:33:19] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:34:05] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [16:34:12] (03CR) 10BryanDavis: bullseye: add bzip2 and zstd compression programs (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) (owner: 10BryanDavis) [16:35:00] (03PS1) 10Ayounsi: Revert "Move all eqiad VRRP mastership to cr2" [homer/public] - 10https://gerrit.wikimedia.org/r/843546 [16:41:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P35537 and previous config saved to /var/cache/conftool/dbconfig/20221017-164143-ladsgroup.json [16:43:43] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10xcollazo) 05Open→03Resolved We discussed this and figured that @BTullis, and a new SRE that is joining us soon, shoul... [16:49:35] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:35] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:50:19] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:50:35] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:50:51] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:39] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:51:49] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) 05Open→03Resolved [16:51:54] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) 05Resolved→03Open Keeping the existing setup was the one possible outcome I had tried to prevent here :( [16:52:35] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:52:48] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Dzahn) @BTullis Is there any way we could get this out of the exim aliases? ...pleaaasse.... [16:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:55:14] (03CR) 10Ayounsi: [C: 03+2] Revert "Move all eqiad VRRP mastership to cr2" [homer/public] - 10https://gerrit.wikimedia.org/r/843546 (owner: 10Ayounsi) [16:55:50] (03Merged) 10jenkins-bot: Revert "Move all eqiad VRRP mastership to cr2" [homer/public] - 10https://gerrit.wikimedia.org/r/843546 (owner: 10Ayounsi) [16:56:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P35538 and previous config saved to /var/cache/conftool/dbconfig/20221017-165649-ladsgroup.json [16:57:09] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:59:25] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:00:04] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T1700). [17:00:18] (03PS1) 10Ayounsi: Revert "Drain eqiad-drmrs GTT link" [homer/public] - 10https://gerrit.wikimedia.org/r/843547 [17:02:41] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:03:57] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:04:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:05:24] (03PS2) 10Ayounsi: Revert "Drain eqiad-drmrs GTT link" [homer/public] - 10https://gerrit.wikimedia.org/r/843547 [17:06:13] that's expected, I restored VRRP to it's normal sate ^ [17:06:13] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:25] (03CR) 10Ayounsi: [C: 03+2] Revert "Drain eqiad-drmrs GTT link" [homer/public] - 10https://gerrit.wikimedia.org/r/843547 (owner: 10Ayounsi) [17:06:27] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:07:00] (03Merged) 10jenkins-bot: Revert "Drain eqiad-drmrs GTT link" [homer/public] - 10https://gerrit.wikimedia.org/r/843547 (owner: 10Ayounsi) [17:11:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318955)', diff saved to https://phabricator.wikimedia.org/P35539 and previous config saved to /var/cache/conftool/dbconfig/20221017-171156-ladsgroup.json [17:11:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:12:02] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:12:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:12:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P35540 and previous config saved to /var/cache/conftool/dbconfig/20221017-171229-ladsgroup.json [17:16:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32787 [17:19:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32787 [17:26:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P35541 and previous config saved to /var/cache/conftool/dbconfig/20221017-172658-ladsgroup.json [17:27:03] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:27:08] (03PS1) 10Ayounsi: Management: remove access/wifi exceptions [homer/public] - 10https://gerrit.wikimedia.org/r/843519 (https://phabricator.wikimedia.org/T320962) [17:42:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P35542 and previous config saved to /var/cache/conftool/dbconfig/20221017-174204-ladsgroup.json [17:49:15] (03PS1) 10Ebernhardson: cirrus: Correct comments in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843521 (https://phabricator.wikimedia.org/T262630) [17:53:05] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:53:53] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10mpopov) > Just for clarification, we are talking about the service named `apple-search` in service discovery and not `search` or `search-https`, as... [17:55:57] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10RLazarus) p:05Triage→03Medium [17:56:16] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10RLazarus) [17:57:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P35543 and previous config saved to /var/cache/conftool/dbconfig/20221017-175711-ladsgroup.json [17:59:51] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:07] (03PS1) 10Dzahn: remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) [18:07:31] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:07:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 243, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:09:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:12:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318955)', diff saved to https://phabricator.wikimedia.org/P35544 and previous config saved to /var/cache/conftool/dbconfig/20221017-181217-ladsgroup.json [18:12:23] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [18:17:25] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10BTullis) Ok @dzahn - I'm sorry, I didn't realise that moving this out of the Exim aliases was important to you. I though... [18:17:50] 10SRE, 10ops-eqiad, 10Data-Persistence, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10RLazarus) [18:18:55] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:21:39] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10Ladsgroup) [[https://phabricator.wikimedia.org/T315486#8172401|Again]], you can have it in mailman as well, relend alerts... [18:22:04] (03PS1) 10Volans: custom_script_proxy: increase polling sleep [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/843525 [18:24:14] (03CR) 10Ayounsi: [C: 03+1] custom_script_proxy: increase polling sleep [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/843525 (owner: 10Volans) [18:28:08] (03CR) 10Volans: [C: 03+2] custom_script_proxy: increase polling sleep [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/843525 (owner: 10Volans) [18:28:52] (03Merged) 10jenkins-bot: custom_script_proxy: increase polling sleep [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/843525 (owner: 10Volans) [18:32:26] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Volans) This is the data for the failed disk `lang=bash $ sudo perccli64 /c0/eall/sall show CLI Version = 007.1910.0000.0000 Oct 08, 2021 Operating system = Linux 5.10.0-18-amd64 Controlle... [18:37:07] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:37:20] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/37584/conf1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:38:32] 10SRE-tools, 10Icinga, 10Infrastructure-Foundations: get-raid-status-perccli should allow for commands to return non-zero exit code - https://phabricator.wikimedia.org/T320998 (10Volans) p:05Triage→03Medium [18:39:18] (03PS2) 10Dzahn: remove git-ssh from common/service.yaml and conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) [18:39:21] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Volans) For the failure of the script I've opened T320998 [18:39:23] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:48:37] !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: name=phab2001-vcs.codfw.wmnet [18:49:25] !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: name=phab1001-vcs.eqiad.wmnet [18:53:38] (03PS1) 10Dzahn: conftool-data: remove phabricator / git-ssh [puppet] - 10https://gerrit.wikimedia.org/r/843567 (https://phabricator.wikimedia.org/T296022) [18:53:47] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:53:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:53:55] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:54:21] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [18:54:27] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:54:34] (03CR) 10Dzahn: [C: 03+2] "https://wikitech.wikimedia.org/wiki/Conftool#Decommission_a_server" [puppet] - 10https://gerrit.wikimedia.org/r/843567 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [18:55:06] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:56:05] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:56:14] !log puppetmaster1001 - deleted confd-template .err files [18:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:07] !log puppetmaster2001 - deleted confd-template .err files [18:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:45] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [18:58:33] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [18:59:19] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [18:59:55] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn removing a service https://wikitech.wikimedia.org/wiki/PyBal [18:59:55] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn removing a service https://wikitech.wikimedia.org/wiki/PyBal [18:59:55] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn removing a service https://wikitech.wikimedia.org/wiki/PyBal [18:59:55] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn removing a service https://wikitech.wikimedia.org/wiki/PyBal [19:00:06] (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:02:55] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:02:55] (03PS3) 10Dzahn: remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) [19:03:05] (03PS4) 10Dzahn: remove git-ssh from common/service.yaml [puppet] - 10https://gerrit.wikimedia.org/r/843522 (https://phabricator.wikimedia.org/T296022) [19:03:05] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:05:13] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:05:17] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:08:05] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [19:12:03] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:12:09] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [19:13:23] sigh.. I keep fixing it but it is coming back. removing properly is next to impossible for me [19:13:58] next is the change to service.yaml then [19:17:45] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:18:21] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:01] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:20:20] !log otrs1001 - started failed clamav-daemon service [19:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:27] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:20:41] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:26:47] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:27:05] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:27:21] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:57] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:30:12] 10SRE, 10ops-eqiad, 10DBA, 10Sustainability (Incident Followup): Check DIMM A6 on db1131 - https://phabricator.wikimedia.org/T320994 (10LSobanski) [19:32:26] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837105 (https://phabricator.wikimedia.org/T317467) (owner: 10Esanders) [19:34:49] (03PS1) 10Bartosz Dziewoński: Use ParsoidOutputAccess when RESTBase is not set up (WMF private wikis) [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843549 (https://phabricator.wikimedia.org/T315689) [19:34:59] (03PS1) 10Bartosz Dziewoński: Log page/revision IDs when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843550 (https://phabricator.wikimedia.org/T315688) [19:35:27] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:37:37] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:38:15] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:23] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:44:05] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [19:44:41] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:55:47] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:56:22] (03PS1) 10Dzahn: Revert "conftool-data: remove phabricator / git-ssh" [puppet] - 10https://gerrit.wikimedia.org/r/843551 [19:57:47] (03PS1) 10Zabe: apache: Drop ve.wikimedia.org rewrite [puppet] - 10https://gerrit.wikimedia.org/r/843569 (https://phabricator.wikimedia.org/T320890) [19:57:57] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [19:58:16] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet [19:59:44] (03CR) 10Dzahn: [C: 03+2] Revert "conftool-data: remove phabricator / git-ssh" [puppet] - 10https://gerrit.wikimedia.org/r/843551 (owner: 10Dzahn) [19:59:49] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:00:04] RoanKattouw, Urbanecm, cjming, and TheresNoTime: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T2000). nyaa~ [20:00:05] Kemayo and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] i can deploy today! [20:00:18] * urbanecm is wondering what's the nyaa~ thing [20:00:27] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:34] MatmaRex: Kemayo: hi! [20:00:45] hi [20:01:22] (03CR) 10Urbanecm: [C: 03+2] Use ParsoidOutputAccess when RESTBase is not set up (WMF private wikis) [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843549 (https://phabricator.wikimedia.org/T315689) (owner: 10Bartosz Dziewoński) [20:01:24] (03CR) 10Urbanecm: [C: 03+2] Log page/revision IDs when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843550 (https://phabricator.wikimedia.org/T315688) (owner: 10Bartosz Dziewoński) [20:01:53] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837105 (https://phabricator.wikimedia.org/T317467) (owner: 10Esanders) [20:02:09] MatmaRex: the beta patch will land soon, i'll let you know when the backports can be tested [20:02:19] thanks [20:02:38] (03Merged) 10jenkins-bot: Enable DiscussionTools mobile on enwiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/837105 (https://phabricator.wikimedia.org/T317467) (owner: 10Esanders) [20:02:57] i can do https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/843465 too, not seeing Kemayo here ATM, but it also sounds fairly simple [20:03:59] i can also take repsonsibility for that if he's not around [20:04:14] i think testing for that is just checking that the error numbers are dropping in logs [20:05:38] yep, i think so too [20:05:41] (03CR) 10Urbanecm: [C: 03+2] Fix editattempt_block country_code not being string [extensions/WikimediaEvents] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843465 (https://phabricator.wikimedia.org/T320938) (owner: 10DLynch) [20:05:45] +2'ed as well [20:06:35] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:06:37] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:08:32] (03Merged) 10jenkins-bot: Use ParsoidOutputAccess when RESTBase is not set up (WMF private wikis) [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843549 (https://phabricator.wikimedia.org/T315689) (owner: 10Bartosz Dziewoński) [20:08:35] (03Merged) 10jenkins-bot: Log page/revision IDs when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843550 (https://phabricator.wikimedia.org/T315688) (owner: 10Bartosz Dziewoński) [20:08:37] (03Merged) 10jenkins-bot: Fix editattempt_block country_code not being string [extensions/WikimediaEvents] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/843465 (https://phabricator.wikimedia.org/T320938) (owner: 10DLynch) [20:09:33] MatmaRex: all three pulled to mwdebug1001, can you check? [20:09:53] yeah [20:10:10] i can check the first one, for the other two i think we have to watch the logs [20:11:27] sounds ok with me [20:11:45] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:49] my edit on https://office.wikimedia.org/wiki/OfficeWiki:T315689_test didn't cause any errors to be logged, so that looks good [20:12:17] okay, so, ok to sync i guess? [20:12:56] yes please [20:13:34] doing [20:13:36] !log urbanecm@deploy1002 Started scap: 6762292a4: e320d48c8: 6762292a4: DicsussionTools/WikimediaEvents backports (T315688, T315689, T320938) [20:13:43] T320938: mediawiki.editattempt_block: '.country_code' should be string - https://phabricator.wikimedia.org/T320938 [20:13:43] T315689: MWException: Error contacting the Parsoid/RESTBase server (HTTP 403): (no message) from DiscussionTools (on private wikis) – permalinks unavailable - https://phabricator.wikimedia.org/T315689 [20:13:44] T315688: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on open wikis) – permalinks unavailable for some edits - https://phabricator.wikimedia.org/T315688 [20:14:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Damilare Adedoyin - https://phabricator.wikimedia.org/T319057 (10XenoRyet) @Dzahn Hi there, I'm @Damilare's manager, and I approve. [20:15:04] urbanecm: Oops, sorry I missed the start of the window. @MatmaRex thanks for catching it for me. [20:15:11] hi Kemayo! [20:15:28] And yes, mine's 100% a watching-logs one. [20:16:26] (03PS1) 10AOkoth: vrts: rename daemon in install script [puppet] - 10https://gerrit.wikimedia.org/r/843572 (https://phabricator.wikimedia.org/T317059) [20:17:51] (03CR) 10Dzahn: [C: 03+1] vrts: rename daemon in install script [puppet] - 10https://gerrit.wikimedia.org/r/843572 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [20:18:11] !log urbanecm@deploy1002 Finished scap: 6762292a4: e320d48c8: 6762292a4: DicsussionTools/WikimediaEvents backports (T315688, T315689, T320938) (duration: 04m 35s) [20:18:24] MatmaRex: all live! [20:18:36] thanks [20:18:44] (03CR) 10AOkoth: [C: 03+2] vrts: rename daemon in install script [puppet] - 10https://gerrit.wikimedia.org/r/843572 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [20:19:27] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:21:02] (03PS1) 10Herron: slo_dashboards: move slo definitions and defaults to files [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/843574 (https://phabricator.wikimedia.org/T320749) [20:21:39] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:21:39] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:27:41] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:28:19] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:28:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:28:31] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:30:59] PROBLEM - SSH on db1116.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:32:09] (MXQueueHigh) firing: MX host mx1001:9100 has many queued messages: 7353 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [20:32:26] looking! so much for getting to finish my last IR first :) [20:34:10] here. in meet. acked [20:34:23] it's mx. yea [20:35:06] yeah, I see 181 emails for info@wikipedia.org and 1483 for INFO@wikipedia.org [20:35:31] those are OTRS [20:35:46] nod [20:36:08] and wed had alerts about clamav-daemon there [20:36:11] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:36:14] which we had restarted [20:36:18] as in .. ^ that [20:36:32] aha! [20:36:38] we should try to deliver those mails again [20:36:47] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:36:51] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:09] arnoldokoth: fyi! since we just talked about that. the page is related [20:37:21] sounds reasonable -- it sounds like you have context, can I let you do the honors? [20:37:22] mutante: clamd failed 20 minutes ago based on history in here. [20:40:04] quite high load on otrs1001 too, and yeah the 451 temp errors logged by exim on the mxes would line up with clamd or similar [20:40:12] !log mx1001 - exim4 -qf - trying to re-deliver mail in queue for info@ OTRS queue [20:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:48] (03PS1) 10AOkoth: vrts: fix download link [puppet] - 10https://gerrit.wikimedia.org/r/843579 (https://phabricator.wikimedia.org/T317059) [20:40:50] (03PS1) 10Bartosz Dziewoński: Add "Clear Affordances" to DiscussionTools beta feature on most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843580 (https://phabricator.wikimedia.org/T320683) [20:42:50] I don't know if it's doing anything.. command is stuck at command prompt so far [20:43:01] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:43:03] just followed https://wikitech.wikimedia.org/wiki/Exim#force_delivery_attempt [20:43:38] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools mobile visual enhancements at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843581 (https://phabricator.wikimedia.org/T318868) [20:43:38] I just tried rerunning `mailq | grep wikipedia.org | grep -v D | sort | uniq -c | sort -n` and the count for INFO@wikipedia.org is dropping, so looks like it's working [20:43:39] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:48] nor do I know why clamav keeps crashing [20:44:27] it gets killed [20:45:13] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:45:21] yeah, not familiar with the service at all but I just saw the same in journalctl -- looking [20:46:32] it sounded you and arnoldokoth were just working on something related, is there any change we should consider rolling back? [20:46:54] there is nothing to roll back. Arnold was working in cloud [20:47:06] speculating but I think clamav is getting killed because of memory. [20:47:07] and all that was done in prod was to "systemctl start" the service a couple times [20:47:20] nod [20:47:21] I just started it again [20:47:29] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:47:43] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:48:54] looking at exim4 mainlog on otrs1001 [20:50:14] mkdir[23205]: /bin/mkdir: cannot create directory ‘/run/clamav’: File exists [20:50:33] because it did not shut down in a normal way.. yea... [20:50:37] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:52:39] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:53:50] Oct 17 20:50:52 otrs1001 kernel: [15133154.645578] oom_reaper: reaped process 23207 (clamd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [20:53:52] mutante: looks like it got killed again. [20:54:27] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:54:55] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:56:39] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [20:56:44] so maybe it's one specific mail that is the bad one due to the attachment it has [20:57:19] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:05] CPU on that host looks like it started maxing out intermittently around 19:04, not quite two hours ago https://grafana.wikimedia.org/goto/J9ekfEI4z?orgId=1 [20:58:19] reached out to get more help [20:58:25] here to help as well, mutante pinged me [20:59:01] reducing the clamd maxthreads might help to get through the backlog without ooming [20:59:22] * mutante disables puppet on otrs1001 so we can potentially edit the config [20:59:26] can we give it more ram as a stopgap? [20:59:48] jhathaway: I was about to ask the same, it's only got four gigs [21:00:05] Reedy, sbassett, Maryum, and manfredi: I, the Bot under the Fountain, call upon thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221017T2100). [21:00:12] it also has a "debug" setting that is currently not enabled [21:00:15] of course it could be some pathological issue, which always consumes all ram [21:01:12] I see the MaxThreads setting now. it's 12 [21:01:30] I can reduce that to .. 6 and try [21:02:22] security deploy is getting started [21:02:25] !log otrs1001 - temp disabled puppet, changing MaxThreads from 12 to 6 in /etc/clamav/clamd.conf [21:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:49] seems like a common issue though no idea what triggered it in our case. https://usercontent.irccloud-cdn.com/file/hbJdQfFt/Screenshot%202022-10-18%20at%2000.00.59.png [21:03:25] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [21:03:34] it has survived 50s so far [21:03:53] exim log looks a bit better [21:03:54] encouraging [21:04:11] and it's gone [21:04:40] nice! [21:05:09] or its gone, like it got killed again? [21:05:24] yea, gone is the hope that it fixed it:p [21:05:29] damn [21:05:36] we can do just a single thread next [21:05:49] or should we try to bump ram on the vm? [21:06:21] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:43] yeah I wonder if ram will help? also wonder how large are these mails in the backlog [21:06:44] doesn't that require rebooting the VM [21:06:44] from glancing at the queue nothing immediately jumps out as an enormous attachment or anything [21:07:19] pretty much all like 15K or less [21:07:34] mutante: I assume so, but I have never bumpted ram on a ganeti node [21:07:53] and the docs in wikitech seem sparse? [21:08:04] https://wikitech.wikimedia.org/wiki/Ganeti#Resize_a_VM [21:08:09] I think an attachment can make clamav die regardless of total size [21:08:26] but no idea how to identify which mail kills it.. if it's always the same [21:09:06] !log otrs1001 - changing MaxThreads from 6 to 1 in /etc/clamav/clamd.conf, starting clamav [21:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:34] jinxer-wm: yeah gnt-instance modify should do the trick, "increase/decrease cpu/ram" [21:09:39] oh look, End queue run: pid=3611 [21:09:44] jhathaway: not jinxer-wm [21:09:48] exim4 log on otrs1001 stopped moving [21:09:55] until now [21:10:36] jhathaway: when we have "Retry time not yet reached" can we ask it to try anyways? [21:10:44] would that just be the -qf ? [21:10:56] I would think that would be -qf [21:11:11] tries it on otrs1001 [21:11:22] 2022-10-17 21:11:15 Start queue run: pid=4847 -qf [21:11:32] clam: Active: active (running) since Mon 2022-10-17 21:07:27 UTC; 3min 0s ago [21:11:52] some are being Completed [21:13:01] ok, so "exiqgrep -c" on otrs1001 is going down [21:13:05] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:13:12] but on mx1001 it's still going up [21:14:00] we are now under 100 mails in the local queue [21:15:06] 60 .... 50 [21:15:21] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:16:28] jhathaway: rzl: on otrs1001 it seems done.. queue is empty now. but mx1001 keeps growing .. what next [21:16:37] -qf on mx1001 now? [21:17:06] yeah probably, but the completion rate on otrs1001 seemed slow [21:17:09] (MXQueueHigh) firing: (2) MX host mx1001:9100 has many queued messages: 9665 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [21:17:33] jhathaway: with just one thread [21:18:36] yeah, so I think we need to fix that before giving it 7000 mails [21:18:42] does it. exim4 -qf and in another windows exiqgrep -c [21:18:48] doesnt do it [21:18:53] :) [21:19:30] well, then, let's see if it still crashes with the original setting [21:19:37] yeah sounds good [21:20:01] !log otrs1001 - re-enabling puppet, running puppet [21:20:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:35] load is 60 [21:21:41] even just re-enabling puppet takes ..forever [21:21:43] box is pretty unusable [21:22:09] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:23:19] I think we should try giving it more ram, or is that not an option mutante? [21:24:15] well, it needs reboot of the machine [21:24:41] and when changing hardware and then rebooting.. often the NIC goes away [21:25:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:25:04] if there is no other way.. we have to do that [21:25:22] it's scary though to reboot the only server there is [21:25:38] I have never rebooted the otrs box, so I agree with your judgement [21:26:42] arnoldokoth: opinions on the reboot? has it been done [21:26:47] ? [21:27:01] it's not doing us much good anyway if it can't receive email though, right? :) [21:27:50] true, uptime 175 days, which is a bit scary [21:28:05] it's not just a mail server, it's a full application with web UI and users [21:28:06] Yeah, not recently. [21:28:07] (I also need to duck into a meeting in 2m or so -- I'll defer to mutante and arnoldokoth's judgment, I don't have anything brilliant to add) [21:28:22] rzl: thanks for the help [21:28:31] hmm having a look through the headers of the queued mails to info@ on mx1001 looks like a possible influx of backscatter? [21:28:47] back in 30m or so if you still need me [21:29:19] we could do an inplace reboot first, and if that works consider bumping ram? [21:29:44] I will just revert my config change manually and give it the full threads again, see if it still crashes after local queue has been drained [21:30:00] okay [21:32:33] are we sure these are legitimate messages being processed by otrs? [21:33:15] no [21:33:22] !log mstyles@deploy1002 Synchronized php-1.40.0-wmf.5/extensions/CheckUser/src/Api/ApiQueryCheckUser.php: (no justification provided) (duration: 03m 37s) [21:33:23] at least I haven't looked at them yet [21:33:45] if they are spamming lets drop'em [21:33:50] !log otrs1001 - after local exim queue has been drained, set MaxThreads for clamav to 12 again, restarted clamav [21:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:41] I'm seeing a lot of undeliverable messages, like someone spoofed a from address and bounces are coming back to info@ [21:34:48] but please check me on that [21:34:49] that made the swapping behavior stop [21:34:55] "After a special procedure, we are happy to inform you that you are one of 3 people in the final draw for this prize from Mini Cooper" [21:35:00] something did anyways, but I assume the timing was the clamav change [21:35:18] so looks like a bunch of spam [21:35:45] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:35:58] we have 12 threads again and it's still alive [21:36:07] clamd's memory usage is high [21:36:08] now if we could send _some_ mails from mx1001 over? [21:36:55] do we dare to just -qf on mx1001 now? [21:37:48] ugh nevermind, swapping is back again [21:38:07] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/843530 [21:38:39] mutante: I think we should drop the spam first [21:38:59] that's what the daemon on otrs1001 is for ..to decide what is spam [21:39:00] yeah there's a bunch of spam in the local queue on otrs1001 [21:39:09] I don't think we can check each mail manually [21:39:48] the local queue on otrs1001 was already empty [21:39:55] why does it have 720 again :( [21:40:59] clamav is running and the queue is growing is a new type of failure now :( [21:41:30] I going to see if I can script out killing the spam... [21:42:12] https://ticket.wikimedia.org/ is taking longer than usual to load on my end (timed out twice). Could we just bite the bullet? Take it down and increase the resources and see if that helps. The system might actually be unusable at this point for the volunteers at least. [21:42:31] they're delivery failures I think [21:43:16] either way, we could simply kill all the messages that have messagelabs.com as the sender? [21:43:44] which itself seems to be an anti-spam service heh [21:44:19] is it possibly a loop of delivery failures between us and them, due to an email reflection? [21:44:31] bblack: yea, I think that's a good idea. trying to count with exiqgrep ..it's just all so slow.. and now it says only 5 matches [21:45:58] exiqgrep -f messagelabs -c [21:45:59] 5 matches out of 922 messages [21:46:20] I have something to kill the mini cooper spam [21:46:24] hmm yeah [21:46:37] the messagelabs ones are just bounces [21:47:05] jhathaway: cool! do it [21:47:10] it seems like basically our otrs stuff is forwarding spam to a bunch of real recipients, and we're getting all the bounces back from their spam systems [21:47:11] https://privatebin.net/?3f4757a36a25ed7e#ERscssPfAEFATSybkvwJnhB39sKufq19H3JARZ7z8TwL [21:47:44] try it! [21:48:41] it will catch the bounces too, since they quote the cooper spam, but that's probably a good thing [21:49:51] nice jhathaway, I think we could run that matching mails with From: (U|u)ndeliver.* and From: Mail Delivery Failure as well [21:50:12] ok [21:50:44] still running, its pretty slow [21:52:14] the non-body matches might be able to go faster [21:52:44] yeah good point [21:53:16] maybe we could stop the actual otrs service, the perl processes [21:53:43] doh s/From/Subject ... been staring at the screen too long [21:56:23] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:43] awww..,maan [21:57:08] starting that once again [21:57:42] so, to increase mem, we have to reboot the instance? [21:57:45] Ignoring deprecated option DetectBrokenExecutables at /etc/clamav/clamd.conf:38 [21:57:49] yea [21:58:19] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:58:41] bblack: we think so [21:58:43] see all the perl processes run by the otrs user [21:58:51] maybe we should stop that part [22:00:53] I assume those are behind apache and drive the web UI [22:00:55] wait, something is improving, load going down [22:00:59] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:20] out of meeting, sorry about that, reading back [22:01:52] yeah load is getting much better [22:01:59] rzl: not much has changed, currently dropping the mini cooper spam, only about 800msgs left [22:02:01] https://www.irccloud.com/pastebin/JtEQjk8J/ [22:02:09] (MXQueueHigh) firing: (2) MX host mx1001:9100 has many queued messages: 4424 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [22:02:10] load went down massively, started clamav again though [22:02:16] okay mini cooper spam has been dropped [22:02:23] cool [22:02:49] queue size larger than before we started though [22:03:07] that's ok as long as load's ok and it moves in the right direction [22:03:17] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:03:42] there's still ~1K emails in there that seem "stable" (probably delivery failures waiting on retries?) [22:06:38] caught up -- need another pair of hands on anything? [22:06:42] swapping and load just as bad as before now.. uhmpf [22:07:17] yeah it's just cyclic maybe [22:08:04] in exiqgrep terms, a lot of these are: -f '<>' [22:08:25] is that a reliable signal that it's spam to be dumped (and/or failure reports related)? [22:08:33] I wonder what happens if all the otrs perl processes are stopped. is it just no web UI then.. but mails would still be delivered to the queues and be seen once it gets started again [22:08:51] bblack: I think that is pretty reliable [22:08:59] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:09:05] I am dropping the same messages on otrs now, the mini cooper ones [22:09:28] then we don't even need the body-scanning command, just pipe exqgrep -if '<>' into exim -Mrm ? [22:10:06] arnoldokoth: as you pointed out.. I should re-revert the "12 threads for clamav" manual change [22:10:09] okay done, only 30 in queue now [22:10:16] very nice [22:11:08] disabling puppet again on otrs1001 [22:11:41] load is down to 3 [22:12:07] lowering MaxThreads for clamav from 12 to ... 2 .. if that still makes sense now [22:12:32] yeah I'm wondering if 1 or 2 threads is a reasonable long term state for otrs clamd config [22:12:32] unwinding temp things seems like a good plan! [22:13:03] well, it would have been re-adding the temp thing [22:13:07] ah I see [22:13:15] to make sure clamav does not get killed due to OOM [22:13:41] the fear on reboot is the nic name will randomly change and need fixup in /e/n/i and/or puppetization? [22:13:43] I have edited the config file but not restarted the service [22:14:12] bblack: yes, and just general concern that the because hasn't been reboot in 175 days [22:14:20] rebooted [22:14:20] I think better to have the mail queue at the sending mx and get processed eventually than knock clamd over and need manual intervention. would be good to filter some obvious bounce messages too [22:14:21] ok [22:14:24] yea, that and in general that there is no failover for it whatsoever and we can't even login [22:14:29] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:15:28] load is high again [22:15:30] 35 [22:15:42] clamav oomed [22:15:47] lol [22:16:15] the woest part is the swapping dragging everything down [22:16:28] started it.. and this time it should be just 2 threads [22:16:28] yeah [22:16:36] it was stable-ish at 1 thread yea? [22:17:05] can I just disable swap on the host? then maybe oom crashing will just happen faster with less impact [22:17:37] I don't think so, it is using 3G of swap [22:17:43] herron: after the initial success when the queue was completely empty, we had upped it to 12 again. yes, 1 was stable but flow [22:17:46] slow [22:17:55] i.e. I don't think this box works at all without swap at present [22:18:16] active+inactive currently barely fits [22:18:46] it's hard to make sense of, hmmm [22:19:08] clamd config file has this: ExitOnOOM false [22:19:10] hah [22:19:34] so we are back on 1 thread? [22:19:59] jhathaway: yea, now we are [22:20:18] where does the config live out of curiosity? [22:20:34] /etc/clamav/clamd.conf [22:20:37] thanks [22:20:43] there are various options what to scan and what not to scan [22:21:34] yeah ~1.2GB swapped out now [22:21:37] :/ [22:21:41] load still at 50+ should we consider other options? [22:22:43] watching exiqgrep -ir 'info@wikipedia.org' | wc -l what a rollercoaster ride [22:23:04] what are the other options though? stop apache2? [22:23:25] mutante: not sure yet :) [22:23:45] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:32] freshclam doesn't have to run, that's updates for clam [22:25:10] the underlying ganeti has room to try doubling up the memory size, if we want to go down that road [22:26:02] bblack: do you know how much room we try to leave for the ganeti instance itself? [22:26:13] I saw it had 16Gs at present [22:26:42] load is back under 5 [22:26:55] according to gnt-node, it has 18.1G available to allocate to instances on ganeti1013 [22:27:15] I'm not exactly sure on the semantics, but seems safe to change this one node from 4G -> 8G [22:27:24] nod [22:27:33] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [22:27:53] which would be past the total of mem+swap we have now, so might be enough to avoid some of the slowdown happening when it swaps [22:28:11] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:28:24] the only real downside is the various risks we create new functional issues by rebooting [22:28:27] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:32] but we can probably triage that [22:28:38] bblack: yeah [22:28:43] mutante: thoughts? [22:28:54] I don't like it but I have no better idea at this point. [22:29:09] arnoldokoth checked that the console works..which is good! [22:29:23] arnoldokoth: were you able to login as root? [22:29:37] at the ganeti console [22:29:50] Yup. [22:29:53] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [22:30:08] well.. then I guess we need to do that and give it the extra RAM [22:30:26] my vote would be for a no-change reboot first [22:30:57] just to verify we can bring it back up, unless folks are comfortable with the ganeti commands [22:31:16] my ganeti foo is weak [22:31:22] the ganeti command is not that special [22:31:31] we are concerned that the NIC changes numbers [22:31:37] ok [22:31:38] but we can fix that if it happens [22:31:50] the rest of the concern is just like "on every reboot" [22:31:56] so maybe 2 of them doubles that risk :p [22:32:02] :) [22:32:14] memory size shouldn't introduce the device numbering issue like adding a disk does, but I'd better knock on wood [22:32:59] okay someone want to run the ganeti command and reboot the box? [22:33:29] 🥶 [22:33:30] I can take a shot [22:33:41] ok, +1, bblack [22:34:07] !log ganeti1027: executing gnt-instance modify -B maxmem=8192 -B memory=8192 otrs1001.eqiad.wmnet [22:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:40] never used maxmem, only memory, interesting [22:34:57] as an aside is kvm the same as xen in that maxmem sets the ceiling that you can live resize up/down from? [22:35:03] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:35:47] mutante: I have no idea what I'm doing, I'm just reading docs and trying things :) [22:35:53] maybe -m alone was enough! [22:35:57] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on otrs1001.eqiad.wmnet with reason: reboot [22:36:12] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on otrs1001.eqiad.wmnet with reason: reboot [22:36:23] !log ganeti1027 - gnt-instance reboot otrs1001.eqiad.wmnet [22:36:25] downtime added (icinga + alertmanager) [22:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:47] arnoldokoth: watch it boot?:) [22:36:50] good thinking :) [22:37:13] responding to ping again [22:37:14] back up! [22:37:18] watching... [22:37:19] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:37:20] what.. too fast :p [22:37:45] interface name stayed stable [22:37:45] https://ticket.wikimedia.org/otrs/index.pl shows content [22:37:53] good! [22:37:56] memory change did what it should've done [22:38:14] MiB Mem : 7978.3 total [22:38:15] first agent run is going [22:38:17] look at all the extra space for activities [22:39:31] watch exiqgrep -c number goes down [22:39:41] looking at general icinga status too, in case some things didn't come back right [22:39:55] shall we bump the threads or wait some more? [22:39:55] (I already re-scheduled checks in case that's some of the red/purple there) [22:40:11] CRITICAL - degraded: The following units failed: ifup@ens13.service this is probably a race [22:40:17] I have seen it before [22:40:26] no spamassassin running either [22:40:52] which is disabled on the host, so that's odd [22:41:06] !log otrs1001 - systemctl reset-failed (clear alert for ifup@ens13.service) [22:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:01] clamav-daemon looks alright [22:42:16] probably could have used a few more cores in retrospect [22:43:02] yeah but icinga is complaining there's no process 'spamd' in the process table, which is true [22:43:34] ACK.. looking at the puppet part [22:44:13] mutante: puppet still disabled... could it be related? [22:44:22] arnoldokoth: shouldn't be [22:44:34] re-enabling puppet [22:44:34] ah yeah, that might help a lot [22:44:38] disabling puppet, just prevents future puppet runs [22:44:39] run the agent too [22:44:42] running puppet [22:44:58] Notice: /Stage[main]/Spamassassin/Service[spamassassin]/ensure: ensure changed 'stopped' to 'running' (corrective) [22:45:02] it's often the case that some things on some clusters/hosts only work once the agent runs, after a reboot [22:45:05] Notice: /Stage[main]/Clamav/File[/etc/clamav/clamd.conf]/content: content changed [22:45:12] * jhathaway hides [22:45:14] we could have a debate about whether that's a good thing :) [22:45:17] Notice: /Stage[main]/Exim4/File[/var/spool/exim4/db]/group: group changed 'root' to 'Debian-exim' (corrective) [22:45:21] ^ eh [22:45:33] that last one I did not expect [22:45:55] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:46:03] mail queue is empty too [22:46:03] [/var/spool/exim4/scan]/owner: owner [22:46:18] what's with all the background BFD alerts btw? [22:46:21] ok, and NOw we tell mx1001 to -qf ? [22:46:22] spamasssain should be fixed, weird to have a disabled service that puppet starts [22:46:52] mutante: we could, it will clear it out it due course [22:46:53] mutante: the clamav stuff is now at 12 threads again? [22:47:19] the config is for sure.. ehm [22:47:30] also I wonder if exim needs a restart after that perms change [22:47:43] it should be as long as puppet refresh as enough Scheduling refresh of Service[clamav-daemon] [22:47:53] let's restart both I guess [22:48:10] go ahead! [22:48:11] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:48:29] done. restarted clamav and exim4 [22:48:41] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:48:45] things look pretty good now [22:48:54] what else to clean up? [22:48:55] "paniclog" is non-empty [22:48:59] that one :) [22:49:11] it will send nagging mails about that too [22:49:12] shall we flush the remaining messages to info@ from the mxes? [22:49:20] herron: yea, I think so [22:49:27] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:49:36] yeah [22:49:54] I am removing the contents from paniclog [22:49:57] ack [22:49:59] it was full of "cant connect to clam" [22:50:19] done [22:50:47] ok, who wants to run the command on mx1001. exim4 -qf ? [22:51:10] 2022-10-17 22:51:03 Warning: No server certificate defined; will use a selfsigned one. [22:51:15] is that normal on the otrs exim? [22:51:26] queue size is already shrinking on mx :) [22:51:44] [apparently it is, it's in old logfiles too] [22:51:48] bblack: I saw that before the reboot as well, not sure [22:51:57] mutante: running [22:52:18] yeah seems to be making progress. loadavg is going up too [22:52:59] finished on mx1001 [22:53:03] used up all the new mem too, swapping a little :/ [22:53:30] arnoldokoth: you were right about "biting the bullet". you said it earlier. I was just too paranoid about that reboot [22:53:31] at least it's not hurting as bad [22:53:46] yeah at least no ooms [22:53:46] (because.. things that happened before) [22:53:48] jhathaway: did you already run the spam cleanup on mx2001? [22:53:54] herron: no [22:53:57] queue is shrinking on otrs1001 now [22:53:59] will do... [22:54:09] ah ok, that explains why the queue is so spicy [22:54:17] 11k messages [22:54:29] thx [22:54:50] memory freed up a bit after its initial burst of activity too, so it got through the rough patch relatively-ok [22:55:01] !log dzahn@cumin2002 START - Cookbook sre.hosts.remove-downtime for otrs1001.eqiad.wmnet [22:55:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for otrs1001.eqiad.wmnet [22:55:05] herron: running [22:55:17] first time using "remove-downtime" cookbook. it was all green again and now we want to see new issues [22:55:17] one of these various mail daemons needs better internal resource limits or something [22:55:31] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:57:21] mutante: hehe. we got lucky. :) [22:57:27] bblack: +1 [22:57:35] arnoldokoth: we better keep 8GB RAM by default [22:58:01] it seems to go through waves now, where it eats up all the 8G of RAM and pushes into swap by a bit, then eventually frees up [22:58:15] seems to coincide with clamd operations, but not 100% sure [22:58:51] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn git-ssh decom side effect to be solved https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:58:51] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn git-ssh decom side effect to be solved https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:58:51] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn git-ssh decom side effect to be solved https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:58:51] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn git-ssh decom side effect to be solved https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:59:13] unrelated noise I was asking about before this incident started [22:59:29] "perl /opt/otrs/bin/otrs.Console.pl Maint::PostMaster::Read" what is this step? [23:00:41] I think it's picking up mail from exim and putting the results in otrs tickets, or vice-versa [23:00:56] probably the former [23:01:10] okay thanks, I assumed so, but wasn't sure [23:01:11] I think that's the main process that reads incoming mail [23:01:18] I'm going to step away fro a bit to help with dinner and bedtime, looks like things are stabilizing but feel free to ping if needed [23:01:21] $Self->Description('Read incoming email from STDIN.'); [23:01:35] thanks herron [23:01:47] mutante: those ACKs, aren't those the .err file thing again? [23:01:58] https://fossies.org/linux/Znuny/Kernel/System/Console/Command/Maint/PostMaster/Read.pm [23:02:01] https://usercontent.irccloud-cdn.com/file/h4PWjcC2/Screenshot%202022-10-18%20at%2002.01.13.png [23:02:59] bblack: unfortunately they seem slightly different. one is .err files in /var/run/confd-template but this is under /etc [23:03:08] and they don't go away with a revert [23:03:30] arnoldokoth: thanks [23:03:44] /var/run/confd-template is empty on both puppet masters [23:04:25] mutante: this time are we completely removing git-ssh from all config? [23:04:42] bblack: I wanted to but I gave up on it and reverted my change [23:04:49] oh ok [23:04:53] I would like to [23:05:05] but I had like 12 alerts [23:05:10] should we give it some more cores? [23:05:10] when I removed from conftool-data [23:05:47] and then there was the question whether I delete it also from common/service.yaml at the same time or not [23:05:47] jhathaway: it seems like its general behavior is to be a resource black hole. if it "works" and stops misbehaving as it is, I'd say leave it [23:06:05] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:17] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:18] yeah, I just assume with its current load of 77 it is not working? [23:06:37] give me a break @ lists [23:06:53] loadavg isn't necessarily terrible on its own [23:06:59] is there a functional problem still? [23:07:09] (MXQueueHigh) resolved: MX host mx2001:9100 has many queued messages: 4623 #page - https://wikitech.wikimedia.org/wiki/Exim - https://grafana.wikimedia.org/d/000000451/mail - https://alerts.wikimedia.org/?q=alertname%3DMXQueueHigh [23:07:10] true good question [23:07:39] I think now we just care about getting mx queue under the limit again [23:07:43] arnoldokoth: do you have a sense if the system is working? [23:07:46] the queue size [on otrs1001] has bounced around a bit, but I haven't seen it go over 100 [23:08:20] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:08:57] and why jinxer-wm is suddenly talking about this I have no idea [23:09:27] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:09:31] I assume it's a mirror of the icinga alert, roughly [23:09:36] ok all the mini cooper spam is gone from mx2001 [23:11:06] queue is 178 on otrs [23:11:22] thanks jhathaway, great [23:11:39] well not so great, as it seems to be climbing again [23:11:53] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:12:07] mutante: the .err files are present for git-ssh, but they seem to still be creating new ones [23:12:57] well we seem to be falling behind again, so we need to either kill more spam, or give more resources, or? [23:13:06] jhathaway: site loads fine but saw lots of "Mail Delivery Failure" which have now turned into a variety of different things. though I'm not sure if I'm looking at the same thing I was when I initially saw that. [23:13:24] mutante: underlying issue is confd templating stuff is still complaining with: [invalid]: server pool cannot be empty! [23:13:32] bblack: yea, so going forward or backward.. you always get some type of alert. that made me give up and I already knew that revert likely would also not fix it :( [23:13:35] maybe not all things are reverted? [23:14:11] {"phab2001-vcs.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=phabricator,service=git-ssh"} [23:14:14] {"phab1001-vcs.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=phabricator,service=git-ssh"} [23:14:27] ^ or maybe after reversion, you also need to set non-zero weight for these and pool them [23:15:04] (and then probably clear those .err files again after that's cleaned up) [23:15:13] I had followed the docs that told me to run a confctl-decom command, then pooled them again. [23:15:30] deleted err files like 10 times. I'll do it again [23:15:52] but now we're re-comming :) [23:16:22] I'll fix weight+pooled status for now [23:16:27] !log bblack@puppetmaster2001 conftool action : set/weight=100; selector: service=git-ssh [23:16:31] yea, did that before as well. somehow they are inactive again [23:16:34] !log bblack@puppetmaster2001 conftool action : set/pooled=yes; selector: service=git-ssh [23:16:47] it will do that every time they're freshly added to etcd [23:17:23] ok, thanks. I think I give up on decom'ing an LVS service. [23:17:31] now we get: Oct 17 23:17:03 puppetmaster2001 confd[27714]: 2022-10-17T23:17:03Z puppetmaster2001 /usr/bin/confd[27714]: INFO Target config /srv/config-master/pybal/eqiad/git-ssh has been updated [23:17:39] I am going to duck away and make dinner, then see how we are doing, that okay with folks? [23:17:40] so I think now just clear the .err one last time on both [23:17:46] or do we want to make other changes? [23:17:56] it seems stable-ish [23:18:10] queue dropped back tiny again on otrs1001 [23:18:24] yeah, not sure I understand the present behavior [23:18:58] I don't see the .err files in /var/run/confd-template/ where I deleted them many times before [23:19:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.275 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:19:41] jhathaway: I wouldn't know what other change, go! [23:19:43] mutante: they're there on both puppetmaster[12]001 [23:19:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:19:54] ok, I'll check back in a bit [23:20:08] root@puppetmaster1001:/var/run/confd-template# ls -lrta|tail [23:20:08] -rw-r--r-- 1 root root 0 Oct 17 23:14 .git-ssh736876848.err [23:20:09] -rw-r--r-- 1 root root 0 Oct 17 23:15 .git-ssh738147531.err [23:20:11] [...] [23:20:27] ooh.. my.. ACK [23:20:34] but it stopped making new ones a few minutes ago, when I pooled+weighted those entries [23:20:51] done. deleted. my brain is fried [23:20:51] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:21:05] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [23:21:05] thanks bblack [23:21:05] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [23:21:09] np! [23:21:13] glad that mailman stuff solved itself.. [23:21:20] I think it's "normal" too :p [23:21:21] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [23:21:21] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [23:22:01] it was simply that I ran "ls" and they are hidden files starting with . of course [23:22:02] someone should make a ticket about making that error state tracking system clean up after itself automagically or something :P [23:22:21] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:22:21] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:22:21] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:22:23] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [23:22:43] did you do that with a pybal restart ^ ? [23:22:52] no, I don't think it's needed [23:22:58] was just fallout of having all entries depooled [23:22:59] I was also convinced I need one [23:23:04] oh man [23:23:06] (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_codfw_git-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [23:23:27] and ipsvadm delete or something :p which I did not want to do [23:23:33] if we try to decom again, *that* may require a set of restarts and ipvsadm manual removals [23:23:49] (to completely remove the service) [23:24:05] yea, ACK. but then when I wanted to revert it felt like it _also_ needs pybal restart. [23:24:11] not true then [23:24:14] I have to leave for dinner in a few minutes though, so I'm out for today :) [23:24:29] yea, same. thanks for the help [23:24:30] ping me tomorrow if you want to tackle it again [23:24:35] ok! [23:25:56] incident is technically over. the reason that p.ages is is gone. since mx is under the treshold [23:29:45] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:34:37] RECOVERY - SSH on db1116.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:36:11] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:38:27] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:41:07] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:42:22] (03CR) 10Legoktm: [C: 03+1] bullseye: add bzip2 and zstd compression programs [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842992 (https://phabricator.wikimedia.org/T294607) (owner: 10BryanDavis) [23:44:13] (03CR) 10Legoktm: [C: 03+1] "My only question is whether we forsee any need to version this image." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [23:47:59] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:50:13] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:55:08] (03CR) 10BryanDavis: mysql: new image for mysql backups (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/842993 (https://phabricator.wikimedia.org/T254636) (owner: 10BryanDavis) [23:56:49] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:59:05] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status