[00:00:45] (03PS21) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [00:01:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:04:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:09:19] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:11:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:16:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:21:02] PROBLEM - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1251.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:21:38] PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1288.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [00:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:42:52] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:46] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:22:04] RECOVERY - MariaDB Replica Lag: s6 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:22:42] RECOVERY - MariaDB Replica Lag: s7 on db2100 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:33:14] PROBLEM - Check systemd state on dbprov2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0200) [02:03:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:04:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:04:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:05:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:42] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) [02:07:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:04] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:02] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [02:27:30] RECOVERY - Check systemd state on dbprov2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:31:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:37:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:38:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0300) [03:06:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:07:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:07:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:09:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:12:38] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:06] RECOVERY - dump of matomo in eqiad on backupmon1001 is OK: Last dump for matomo at eqiad (db1108) taken on 2022-10-11 03:21:25 (1.2 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:51:06] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:58] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [04:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [04:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:52:16] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:05] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0600). [06:10:33] (03PS2) 10KartikMistry: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) [06:34:14] !log kill leftover process of bmansurov on an-airflow1002 to allow user cleanup via puppet [06:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:58] !log delete now unused VC ports on asw2-c4-eqiad - T313384 [06:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:02] T313384: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 [06:37:42] !log kill leftover process of bmansurov on stat1007 to allow user cleanup via puppet [06:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:52] RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [06:43:15] !log kill leftover process of nokafor on stat1004 to allow user cleanup via puppet [06:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:51] !log kill leftover process of jmads on stat1005 to allow user cleanup via puppet [06:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:01] (03CR) 10Santhosh: [C: 03+2] ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry) [06:46:07] (03Merged) 10jenkins-bot: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry) [06:52:04] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 (10ayounsi) [clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed [06:52:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [06:52:35] I'll deploy 839411, as it was +2'ed by mistake. Few minutes to go for Backport deployment window.. [06:53:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [06:53:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [06:53:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10ayounsi) [clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed [06:54:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [06:54:40] 10ops-eqiad, 10Data-Engineering: Check analytics1086's mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) [06:54:53] 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey) [06:57:48] (03PS1) 10Muehlenhoff: Remove LDAP access for aassaf [puppet] - 10https://gerrit.wikimedia.org/r/841389 [07:00:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for aassaf [puppet] - 10https://gerrit.wikimedia.org/r/841389 (owner: 10Muehlenhoff) [07:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:26] * kart_ is here [07:00:33] will self deploy. Minor change. [07:00:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry) [07:01:02] !log kartik@deploy1002 Started scap: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]] [07:01:06] T319156: Make Mongolian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T319156 [07:01:59] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:02:26] (03PS1) 10Muehlenhoff: Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247) [07:09:58] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]] (duration: 08m 56s) [07:10:03] T319156: Make Mongolian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T319156 [07:10:54] I'm done. [07:11:07] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:12:21] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:13:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:15:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:16:32] !log [Elastic] Updated cross-cluster remote seeds (masters): `ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9[2,4,6]43/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst` [07:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [07:17:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [07:17:53] !log [Elastic] Forcing recheck of elastic settings check alerts; expecting a bit of noise as the alerts resolve (hopefully) [07:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:18:00] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:18:18] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [07:18:41] RECOVERY - ElasticSearch setting check - 9200 on elastic1054 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:43] RECOVERY - ElasticSearch setting check - 9600 on elastic1075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:43] RECOVERY - ElasticSearch setting check - 9600 on elastic1073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:43] RECOVERY - ElasticSearch setting check - 9200 on elastic1074 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:43] RECOVERY - ElasticSearch setting check - 9200 on elastic1081 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:45] RECOVERY - ElasticSearch setting check - 9600 on elastic1083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:45] RECOVERY - ElasticSearch setting check - 9200 on elastic1094 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:45] RECOVERY - ElasticSearch setting check - 9600 on elastic1095 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:47] RECOVERY - ElasticSearch setting check - 9200 on elastic1100 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:47] RECOVERY - ElasticSearch setting check - 9600 on elastic1102 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [07:18:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:21:05] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [07:21:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [07:22:26] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:22:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:24:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:30:48] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [07:31:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [07:32:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:34:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [07:39:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [07:40:52] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:41:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:46:38] (03Abandoned) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [07:52:45] (03PS1) 10Muehlenhoff: Add Michael Schönitzer to contributors [puppet] - 10https://gerrit.wikimedia.org/r/841446 (https://phabricator.wikimedia.org/T308013) [07:55:40] (03CR) 10Muehlenhoff: [C: 03+2] Add Michael Schönitzer to contributors [puppet] - 10https://gerrit.wikimedia.org/r/841446 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:55:59] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [08:19:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841122 (owner: 10Jbond) [08:32:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one note inline." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 (owner: 10Jbond) [08:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [08:35:33] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #3 [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748) [08:37:38] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti4008.ulsfo.wmnet [08:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:38:51] (03CR) 10Muehlenhoff: "I fully trust your CSS/HTML expertise there :-) Could we capture that change in README.Debian as well, so that we are aware when we rebase" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond) [08:41:36] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37493/console" [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:48:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:52:39] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #3 [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [08:53:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:53:08] !log partitioning the ATS cache in cp1085, cp1086, cp2037, cp2038, cp3060, cp3061, cp4026, cp4030, cp5006, cp5012, cp6005, cp6013 - T317748 [08:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:13] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [08:59:33] RECOVERY - puppet last run on bast1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:59:43] (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: drop u2f support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841122 (owner: 10Jbond) [08:59:51] (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 (owner: 10Jbond) [09:04:24] (03PS3) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 [09:04:26] (03PS1) 10Jbond: build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456 [09:05:21] (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: typo [puppet] - 10https://gerrit.wikimedia.org/r/841457 [09:08:26] (03PS4) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 [09:09:19] (03PS2) 10Jbond: build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456 [09:09:30] (03PS8) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [09:09:59] (03CR) 10Jbond: casLoginView.html: drop card properties (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond) [09:10:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] casLoginView.html: Add original file from cas 6.6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 (owner: 10Jbond) [09:11:32] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37494/console" [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [09:19:40] (03CR) 10Hnowlan: [C: 03+1] maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:28:25] (03CR) 10Jelto: [V: 03+1 C: 03+2] P:gitlab::runner: Quote environment variable hash keys [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall) [09:32:21] PROBLEM - MariaDB Replica Lag: s4 on db2110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 62453.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:33:41] _joe_: Do you might having a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148? [09:35:20] (03PS1) 10Vgutierrez: hieradata: Remove cp4031 hiera file [puppet] - 10https://gerrit.wikimedia.org/r/841458 (https://phabricator.wikimedia.org/T301269) [09:36:34] <_joe_> hoo: yup, is the maint script updated? [09:37:21] _joe_: Not yet... but is trivial to do (we'll make it accept, but ignore the options at first) [09:40:40] 10SRE, 10Infrastructure-Foundations: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 (10aborrero) The Debian developer wanted to disable autogenerated IPv6 link local addresses on bridged interfaces. Instead of disa... [09:42:14] <_joe_> hoo: ok, so when the script doesn't error out with the parameters, ping me and we'll merge this [09:42:30] Nice, will take care of that :) [09:44:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from cluster for decom [09:44:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from cluster for decom [09:45:39] (03PS1) 10Ayounsi: Enable dhcp relay on ulsfo mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) [09:47:09] (03CR) 10Vgutierrez: [C: 03+2] hieradata: Remove cp4031 hiera file [puppet] - 10https://gerrit.wikimedia.org/r/841458 (https://phabricator.wikimedia.org/T301269) (owner: 10Vgutierrez) [09:49:19] PROBLEM - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:49:20] ACKNOWLEDGEMENT - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T320482 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:49:25] 10SRE, 10ops-codfw: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10ops-monitoring-bot) [09:49:31] (03PS2) 10Ayounsi: Enable dhcp relay for mgmt network [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) [09:50:04] (03CR) 10CI reject: [V: 04-1] Enable dhcp relay for mgmt network [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [09:53:17] (03PS1) 10Muehlenhoff: Remove ganeti1006 [puppet] - 10https://gerrit.wikimedia.org/r/841461 (https://phabricator.wikimedia.org/T320419) [09:53:49] (03PS1) 10Kosta Harlan: Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471) [09:55:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:08] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [09:57:22] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10jbond) > , as we can drop to a regular shell and specify the MAC code manually: FYi you can also use the .ssh/config file whic... [10:00:44] (03PS1) 10JMeybohm: kubernetes::master calculate apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) [10:00:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:02:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti1006 [puppet] - 10https://gerrit.wikimedia.org/r/841461 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff) [10:02:58] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:06:02] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:07:39] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:08:18] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:11:18] (03PS2) 10JMeybohm: kubernetes::master remove apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) [10:12:03] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:12:46] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37496/console" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:13:18] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:22:08] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) I thought i would bring my response here. > Setting skip_acked will also skip recheck_failed_services() Regardless of if we call `... [10:26:19] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "With this patch we won't just disable pregeneration, but also disable cache invalidation. I would assume we'd need to just switch the meth" [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos) [10:27:14] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10JMcLeod_WMF) [10:29:22] (03PS1) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 [10:32:56] 10SRE, 10Security-Team, 10LDAP: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10Peachey88) [10:33:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Run helm dependency build before packaging [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347) (owner: 10JMeybohm) [10:35:54] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) >>! In T319277#8307039, @jbond wrote: > I thought i would bring my response here. > >> Setting skip_acked will also skip recheck_f... [10:39:06] (03CR) 10Jbond: "removing -1 change seems fine to me however im not convinced this is the correct way to go, but have moved that comment to the task" [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [10:39:44] 10SRE, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) Dashboard for all of the relevant metrics to [[ https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues | the incident ]] that triggered t... [10:40:57] (03CR) 10Majavah: [C: 03+2] Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471) (owner: 10Kosta Harlan) [10:41:10] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [10:42:16] (03PS2) 10Zabe: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [10:42:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841145 (owner: 10Arturo Borrero Gonzalez) [10:43:27] (03CR) 10Majavah: [C: 03+2] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [10:44:02] (03PS1) 10Muehlenhoff: Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 [10:44:46] (03CR) 10CI reject: [V: 04-1] Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (owner: 10Muehlenhoff) [10:48:16] (03PS2) 10Muehlenhoff: Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (https://phabricator.wikimedia.org/T299459) [10:53:29] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 302 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [10:58:18] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff) [11:02:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841145 (owner: 10Arturo Borrero Gonzalez) [11:03:29] (03Merged) 10jenkins-bot: Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471) (owner: 10Kosta Harlan) [11:03:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [11:05:49] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [11:10:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet [11:10:55] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudnet: merge host hiera overrides back into the profile" [puppet] - 10https://gerrit.wikimedia.org/r/841161 [11:11:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [11:11:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloudnet: merge host hiera overrides back into the profile" [puppet] - 10https://gerrit.wikimedia.org/r/841161 (owner: 10Arturo Borrero Gonzalez) [11:12:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:12:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:12:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) [11:13:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: typo [puppet] - 10https://gerrit.wikimedia.org/r/841457 (owner: 10Arturo Borrero Gonzalez) [11:13:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:15:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:19:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet [11:19:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T314041)', diff saved to https://phabricator.wikimedia.org/P35394 and previous config saved to /var/cache/conftool/dbconfig/20221011-111954-ladsgroup.json [11:19:59] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:20:51] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10Volans) I understand your concerns > Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call rec... [11:20:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:20:57] RECOVERY - MariaDB Replica Lag: s4 on db2110 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:21:20] (03CR) 10Volans: "Adding a couple of more comments, but let's see what we agree on in the task first." [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [11:21:31] (03PS9) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [11:23:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 (owner: 10Jbond) [11:26:05] (03Abandoned) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum) [11:26:14] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A [11:26:42] (03PS10) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [11:27:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A [11:28:13] (03PS11) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) [11:29:25] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) @MatthewVernon continuing from T316845, (and I know I'm pushing my luck he... [11:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35395 and previous config saved to /var/cache/conftool/dbconfig/20221011-113501-ladsgroup.json [11:36:36] (03PS1) 10Ladsgroup: Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955) [11:37:43] (03CR) 10Vlad.shapik: Update the logic to run code coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [11:37:57] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [11:46:37] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) Thanks for tracking all this John. As you know most of our hosts just have a single interface with single unica... [11:47:21] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:47:34] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MatthewVernon) So that `render` is coming from the `zone` setting in your `rewrite.py` c... [11:47:38] (03PS1) 10Hnowlan: haproxy: fix apt repository path [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196) [11:48:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:49:32] (03PS1) 10Arturo Borrero Gonzalez: cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841478 [11:49:58] (03PS1) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479 [11:50:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35396 and previous config saved to /var/cache/conftool/dbconfig/20221011-115007-ladsgroup.json [11:50:45] (03PS11) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [11:52:01] (03PS2) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479 [11:53:03] someone™ still needs to pull and sync this wmf.5 backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841160 [11:53:12] I can do it later but not right now [11:53:22] if anyone else wants to do it in the meantime :) [11:54:37] (03CR) 10Hnowlan: thumbor: new service chart (0333 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:56:46] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) > AFAIK this configures the ssh daemon to accept connections using this protocol (possibly also configures outbound c... [12:00:42] (03CR) 10Hnowlan: [C: 03+2] admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:01:56] (03PS2) 10Arturo Borrero Gonzalez: cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) [12:02:42] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MatthewVernon) I //think// `global-data-phonos-render` is likely the correct location (p... [12:02:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307389, @cmooney wrote: > I'm not sure if this task is the best place to discuss this but I'm of t... [12:05:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T314041)', diff saved to https://phabricator.wikimedia.org/P35397 and previous config saved to /var/cache/conftool/dbconfig/20221011-120514-ladsgroup.json [12:05:19] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:05:55] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1003/37498/" [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [12:08:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [12:09:54] (03PS12) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [12:10:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10MoritzMuehlenhoff) >>! In T234207#8307389, @cmooney wrote: > Thanks for tracking all this John. > > So for instance we c... [12:13:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:08] (03CR) 10Hnowlan: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:13:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi) [12:16:23] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) >>! In T234207#8307423, @jbond wrote: > Perhaps from the netbox PoV but from any new (networkd) module should su... [12:30:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307431, @MoritzMuehlenhoff wrote: >>>! In T234207#8307389, @cmooney wrote: >> Thanks for tracking... [12:32:00] (03PS1) 10Hoo man: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423) [12:32:30] (03PS1) 10Hoo man: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423) [12:32:44] (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #4 [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748) [12:33:14] (03CR) 10Volans: [C: 03+1] "LGTM, optional improvement inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff) [12:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [12:35:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez) [12:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:38:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37499/console" [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [12:39:23] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [12:42:50] jouncebot: now [12:42:51] No deployments scheduled for the next 0 hour(s) and 17 minute(s) [12:42:55] (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489 [12:43:13] alright, I would pull and sync the wmf.5 backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841160 now [12:43:35] it shouldn’t have any effect yet, but my understanding is that, since wmf.5 already exists on the servers, this should also be synced after being merged into the branch [12:44:26] though I’m not 100% sure about that, because the “Branch commit for wmf/1.40.0-wmf.5” (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/840575) apparently hasn’t been pulled+synced yet either [12:45:28] hm, actually, extensions/ and skins/ are empty apart from a README so far [12:45:44] so I’d have to do a sync-world, probably [12:46:01] I think I’ll leave that for the train deployers, then :) [12:46:35] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #4 [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez) [12:46:53] !log partitioning the ATS cache in cp[2035-2036], cp[6004,6012], cp[1083-1084], cp[5005,5011], cp[3058-3059], cp[4025,4029] - T317748 [12:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:58] T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 [12:50:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) 05Open→03In progress [12:50:57] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489 (owner: 10Jgiannelos) [12:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:52:05] (03PS6) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) [12:53:30] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:53:59] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) Note to myself: Check if this is still an issue, and if yes, are we still working on it. [12:55:00] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:55:13] (03Merged) 10jenkins-bot: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489 (owner: 10Jgiannelos) [12:55:38] FYI: I will come a bit later for SWAT (but will do my patches on my own) [12:55:42] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:56:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:20] (03PS1) 10Kosta Harlan: AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327) [12:57:39] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) With the utmost thanks to @MatthewVernon and everyone else who has comment... [12:57:43] (03PS2) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 [12:57:47] (03CR) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff) [12:58:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:40] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10SLyngshede-WMF) a:03SLyngshede-WMF [12:58:51] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:59:20] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:59:22] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) 05In progress→03Resolved This should be resolved now ill tentativly close it, thanks for the ping and please re-open if there are sti... [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1300). [13:00:05] stephanebisson and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1300) [13:00:08] o/ [13:00:19] I can deploy [13:00:33] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:00:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) > One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_auto... [13:00:56] Hello [13:01:34] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:01:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [13:01:39] hi stephanebisson [13:01:55] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff) [13:01:56] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:02:25] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) > add an option to test on one random (or possibly hardcoded) host from both the cloud and wmcs environments This is sti... [13:02:35] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) [13:02:45] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:02:52] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff) [13:03:43] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [13:03:54] (03Merged) 10jenkins-bot: Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson) [13:05:20] fetching [13:06:04] stephanebisson: the change should be on mwdebug1001, can you test it? [13:06:17] Lucas_WMDE, yes, on it [13:06:22] thanks [13:07:27] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:07:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:08:28] Lucas_WMDE Looks good [13:08:58] ok [13:10:03] (03PS1) 10JMeybohm: Randomize tokens in profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/841494 [13:10:05] nemo-yiannis: (assuming you’re jgiannelos) are you looking into the mobileapps alert that icinga just posted above? [13:10:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:10:16] yeah, i am going to revert [13:10:21] ok, just checking :) [13:10:27] I’ll proceed with the backport then [13:10:42] * Lucas_WMDE forgot to use scap backport again [13:11:13] syncing [13:11:33] (03PS1) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 [13:11:50] (03PS1) 10Jgiannelos: Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510 [13:12:08] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) 05In progress→03Resolved [13:12:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:12:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:12:47] (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510 (owner: 10Jgiannelos) [13:13:45] !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [13:13:55] (03CR) 10CI reject: [V: 04-1] kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm) [13:14:14] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [13:14:42] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) > The original idea was that we don't want to ignore ack'ed alerts blindly Im not sure this was the original idea going from the ta... [13:14:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.4/extensions/Wikistories/extension.json: Backport: [[gerrit:840178|Make discovery mode config default to 'off' (T314582)]] (duration: 03m 48s) [13:15:03] T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582 [13:15:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:29] ok, that’s done [13:15:39] hoo will self-service later [13:16:00] Lucas_WMDE thank you! [13:16:04] (03PS2) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 [13:16:06] np :) [13:16:37] (03Merged) 10jenkins-bot: Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510 (owner: 10Jgiannelos) [13:17:13] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [13:17:29] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [13:17:34] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:18:15] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:18:23] (03CR) 10jenkins-bot: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm) [13:18:35] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [13:18:45] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:19:16] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:19:39] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) Verified Netbox Thanks [13:19:47] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) 05Open→03Resolved [13:19:55] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Jclark-ctr) [13:20:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm) [13:20:41] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:21:36] Ok it looks like mobileapps is not complaining any more, i will push a fix and re-deploy [13:21:48] (03PS3) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) [13:22:48] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37500/console" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [13:23:18] (03PS6) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [13:23:30] (03PS3) 10Jbond: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [13:23:40] (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [13:25:31] (03PS4) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) [13:26:56] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Randomize tokens in profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/841494 (owner: 10JMeybohm) [13:30:41] (03PS1) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247) [13:33:27] (03PS1) 10BBlack: Temporary rate exemption for IABot source IPs [puppet] - 10https://gerrit.wikimedia.org/r/841499 (https://phabricator.wikimedia.org/T318065) [13:33:43] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10jbond) im not sure if i ever looked at this task, however i do notice that i have an old close PR for stdlib which seems related... [13:34:34] (03CR) 10Hnowlan: [C: 03+2] Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [13:35:15] (03PS1) 10Ssingh: hiera: decommission dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/841500 (https://phabricator.wikimedia.org/T320440) [13:37:11] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/841500 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh) [13:37:21] (03PS3) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) [13:37:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [13:40:41] (03PS3) 10Ayounsi: Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) [13:41:07] (03PS1) 10Ssingh: sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440) [13:41:33] (03CR) 10BBlack: [C: 03+2] Temporary rate exemption for IABot source IPs [puppet] - 10https://gerrit.wikimedia.org/r/841499 (https://phabricator.wikimedia.org/T318065) (owner: 10BBlack) [13:41:48] (03PS1) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 [13:42:57] (03CR) 10Jgiannelos: "This is related to the production errors triggered from this deployment:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (owner: 10Jgiannelos) [13:43:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I'll rebase https://gerrit.wikimedia.org/r/c/operations/puppet/+/841134 after you've merged" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [13:44:03] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh) [13:44:18] (03CR) 10Jgiannelos: "This is related to this mobileapps change:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (owner: 10Jgiannelos) [13:44:20] (03CR) 10BBlack: [C: 03+2] Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) (owner: 10BBlack) [13:44:29] (03PS4) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) [13:47:23] (03Merged) 10jenkins-bot: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik) [13:49:14] (03PS2) 10Ssingh: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841134 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [13:49:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Manuel) [13:50:14] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED [13:50:47] RECOVERY - SSH on restbase1028 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:50:55] PROBLEM - cassandra-c service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:51:27] PROBLEM - cassandra-b service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:51:33] PROBLEM - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:51:33] PROBLEM - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:51:33] PROBLEM - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:51:39] PROBLEM - cassandra-a service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:51:55] RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 17317 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/RESTBase [13:52:13] PROBLEM - puppet last run on restbase1028 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:52:18] (03PS1) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) [13:53:00] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [13:53:11] RECOVERY - cassandra-c service on restbase1028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:53:41] RECOVERY - cassandra-b service on restbase1028 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:53:53] RECOVERY - cassandra-a service on restbase1028 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:54:00] (03PS2) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 [13:54:05] (03PS2) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) [13:54:45] (03PS3) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) [13:55:09] RECOVERY - cassandra-a CQL 10.64.0.209:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.209 port 9042 https://phabricator.wikimedia.org/T93886 [13:56:01] RECOVERY - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-a valid until 2024-08-30 21:25:17 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:56:01] RECOVERY - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-b valid until 2024-08-30 21:25:20 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:56:01] RECOVERY - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-c valid until 2024-08-30 21:25:22 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:56:15] RECOVERY - cassandra-b CQL 10.64.0.210:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.210 port 9042 https://phabricator.wikimedia.org/T93886 [13:56:33] (03Merged) 10jenkins-bot: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [13:56:37] RECOVERY - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.211 port 9042 https://phabricator.wikimedia.org/T93886 [13:56:47] !log hoo@deploy1002 Started scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] [13:56:53] T238751: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 [13:56:53] T315423: Revive and merge patch to update maxlag calculation - https://phabricator.wikimedia.org/T315423 [13:57:07] !log hoo@deploy1002 hoo and hoo: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:58:12] (03PS7) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [13:58:27] RECOVERY - puppet last run on restbase1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:58:35] (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [13:59:29] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ssingh) [14:00:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:01:45] !log hoo@deploy1002 Finished scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] (duration: 04m 57s) [14:02:38] (03CR) 10Hoo man: [C: 03+2] "Branch is not deployed yet" [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [14:02:48] (03CR) 10Andrew Bogott: [C: 03+2] openstack: keystone: enable app credentials everywhere [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah) [14:03:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:03:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:05:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:06:20] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [14:08:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the chart version too, otherwise this isn't going to be deployable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos) [14:10:13] (03CR) 10Jbond: [C: 03+1] sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh) [14:10:44] (03CR) 10Ayounsi: [C: 03+2] Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [14:11:01] (03PS3) 10Andrew Bogott: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 (owner: 10Majavah) [14:11:36] (03Merged) 10jenkins-bot: Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [14:12:39] (03CR) 10Ssingh: [C: 03+2] sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh) [14:14:25] !log homer "cr*-ulsfo*" commit "Gerrit 841501: sites.yaml: decom dns4002" [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:18] !log completed homer run for Gerrit 841501 [14:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:23] RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:16:21] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 82, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:23] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:31] ^ stems from Gerrit 841501 [14:17:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:17:16] (03PS2) 10Muehlenhoff: Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247) [14:19:29] (03Merged) 10jenkins-bot: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man) [14:19:59] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff) [14:20:48] (03PS1) 10Ssingh: sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247) [14:22:18] (03PS9) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [14:22:20] (03CR) 10Andrew Bogott: [C: 03+2] P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 (owner: 10Majavah) [14:22:45] (03PS2) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247) [14:23:18] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi) For physical servers we indeed need to keep the whole lifecycle/provisioning process in mind (racking/provisioni... [14:24:57] (03CR) 10CI reject: [V: 04-1] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos) [14:25:20] (03PS1) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067) [14:25:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:26:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [14:26:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:26:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:26:41] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [14:27:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:29:10] 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10LSobanski) a:03Dzahn [14:30:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) Just saw that on the patch it was mentioned that the shell access might be needed. To give more context I am not sure if the data... [14:30:25] 10SRE, 10Observability-Logging, 10Observability-Metrics, 10serviceops, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10LSobanski) [14:30:35] (03CR) 10Ssingh: [C: 03+1] Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [14:30:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) > Which means being able to map the real world interface to the logical one, from previous conversations it's o... [14:31:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff) [14:33:22] (03PS1) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 [14:35:38] (03PS1) 10Muehlenhoff: Use profile::base::use_linux510_on_buster for cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/841538 (https://phabricator.wikimedia.org/T297814) [14:39:34] (03PS2) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 [14:39:39] (03CR) 10Jbond: [C: 03+2] casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond) [14:39:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456 (owner: 10Jbond) [14:39:50] (03CR) 10Jbond: [V: 03+2 C: 03+2] casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond) [14:41:00] (03CR) 10Jbond: [C: 03+1] Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 (owner: 10JMeybohm) [14:42:02] (03PS3) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 [14:42:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 (owner: 10JMeybohm) [14:46:31] PROBLEM - Check systemd state on ganeti4008 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:35] (03PS1) 10Filippo Giunchedi: pontoon: latest netbox hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/841541 [14:47:37] (03PS1) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) [14:48:11] RECOVERY - Check systemd state on ganeti4008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [14:49:50] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: latest netbox hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/841541 (owner: 10Filippo Giunchedi) [14:50:43] !log disable cr1-eqiad<->asw2-c-eqiad link for optic replacement [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [14:51:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [14:52:42] (03PS4) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) [14:53:35] (03CR) 10Elukey: "For everybody's context, from the kube-api's help:" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:54:38] jouncebot nowandnext [14:54:38] No deployments scheduled for the next 1 hour(s) and 5 minute(s) [14:54:39] In 1 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1600) [14:56:40] !log re-enable cr1-eqiad<->asw2-c-eqiad link after optic replacement [14:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:07] (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:58:45] (03CR) 10Elukey: [C: 03+1] "LGTM (modulo pcc running and showing no failures)" [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:58:56] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:00:26] (03PS3) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247) [15:00:28] (03PS2) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860) [15:01:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:53] (03PS3) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) [15:02:13] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:02:48] (03CR) 10Samtar: [C: 03+1] "Cherry-picked to beta cluster, appears to be working well" [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe) [15:03:40] (03CR) 10Samtar: "Also cherry-picked to beta, would appreciate a second set of eyes on the config but I think it's reasonable finally 😊" [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar) [15:03:59] (03CR) 10Ssingh: [C: 03+2] dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [15:04:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [15:04:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:05:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster [15:05:11] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster [15:05:14] (03CR) 10Samtar: [C: 03+1] "Cherry-picked to beta cluster, appears to be working well" [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal) [15:07:17] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) Acceptance criteria, issue not present — see also T314294 [15:07:41] 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10BTullis) Just for good measure, I have carried out a cold reset of the IPMI controller with: ` btullis@an-worker1086:~$ sudo bmc-device --cold-reset; echo $? 0 ` I'll check again to see whe... [15:07:45] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ssingh) @RobH: I think we can mark this as resolved as all the Puppet configuration has been removed and you already ran the decom cookbook. Deferring this to you in case something el... [15:08:47] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) Replaced Optic in asw2-c2 port 53. cleaned fiber both ends [15:08:59] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [15:09:10] !log disable cr1-eqiad<->asw2-d-eqiad link for re-cabling - T313463 [15:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:14] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [15:10:28] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:11:23] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) 05Open→03Resolved [15:11:27] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [15:14:40] (03CR) 10MVernon: [C: 03+1] "This looks good to me, thanks for your patience with getting this all working!" [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar) [15:14:41] 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Export confd template status as Prometheus metrics - https://phabricator.wikimedia.org/T319272 (10fgiunchedi) 05Open→03Resolved [15:14:44] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [15:16:01] PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 3 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [15:16:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:16:28] (03CR) 10MVernon: [C: 03+1] "This looks correct to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe) [15:17:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [15:19:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:21:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos) [15:21:21] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Ottomata) > is available in superset It is available in superset, however because it is not a 'small' dataset, it is possible that it might... [15:21:43] (03CR) 10Ladsgroup: [C: 03+2] Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955) (owner: 10Ladsgroup) [15:22:09] (03Merged) 10jenkins-bot: Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955) (owner: 10Ladsgroup) [15:23:06] (03PS1) 10Samtar: InitialiseSettings-labs: Enable Phonos on en_rtlwiki, enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294) [15:23:08] (03PS22) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [15:23:32] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [15:24:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:25:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [15:25:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:26:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance [15:26:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage [15:27:07] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos) [15:27:21] Hi, I'm going to deploy a beta cluster only change ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/841547/ ) in a moment — any reasons not to? :) [15:27:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) I have setup ganeti4008 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected. [15:27:58] None that I'm aware of [15:29:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [15:29:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:30:00] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Phonos on en_rtlwiki, enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [15:30:59] !log deployed beta cluster only change, [[gerrit:841547]], for T314294 [15:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:04] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [15:31:27] (03Merged) 10jenkins-bot: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos) [15:32:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:33:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:33:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:34:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:35:33] !log sudo gnt-node migrate -f ganeti4001.ulsfo.wmnet [15:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:32] !log sudo gnt-node evacuate -s ganeti4001.ulsfo.wmnet [15:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:57] (03Abandoned) 10Dzahn: mediawiki::api: fix kernel parameter name ip_local_port_range [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn) [15:41:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:42:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:10] !log remove materialized .json files from schemas/event/secondary - this should be a no-op as no clients should actually be using the json files. - T315674 [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:15] T315674: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674 [15:44:42] (03PS8) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 [15:45:19] (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [15:46:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:48:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Lea_WMDE) As the team lead of @Manuel I approve! [15:48:50] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194) [15:48:52] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [15:49:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:49:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance [15:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35401 and previous config saved to /var/cache/conftool/dbconfig/20221011-154934-ladsgroup.json [15:49:38] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [15:49:39] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:49:42] 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) 05Open→03Resolved a:03Cyberpower678 [15:49:54] 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Cyberpower678) 05Open→03Resolved a:03Cyberpower678 [15:49:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:50:02] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.5 refs T314194 [15:50:03] (03CR) 10Elukey: [C: 04-1] "Looks like the 1.9.5-patch branch got deleted by upstream, they only offer 1.9.8-patch now.." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey) [15:50:06] T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194 [15:50:16] (03PS1) 10Filippo Giunchedi: sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118) [15:50:35] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dns4004.wikimedia.org with OS buster [15:50:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster completed: - dns4004 (... [15:50:45] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors:... [15:50:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:50:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:51:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:53:22] (03CR) 10Dzahn: [C: 03+2] "I feel the outscome supported my point about merging things this way when we only find out later at reload if things worked." [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar) [15:55:56] (03CR) 10JMeybohm: kubernetes::master fail if user tokens are not unique (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:55:58] (03PS10) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [15:56:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:56:56] (03PS8) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) [15:57:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:57:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:57:42] (03CR) 10Hashar: "Squashed in https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/events-wikimedia/+/814807/10 ;)" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar) [15:58:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:59:52] (03CR) 10Btullis: Add a new production images for spark and spark-operator (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [16:00:59] (03Abandoned) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:02:56] (03CR) 10Btullis: Add a spark-operator production image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:05:10] (03PS1) 10Eigyan: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) [16:05:42] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) 05Open→03Resolved [16:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [16:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35402 and previous config saved to /var/cache/conftool/dbconfig/20221011-161414-ladsgroup.json [16:14:20] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:14:49] (03PS1) 10Matthias Mullie: Rescale images based on width alone [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406) [16:16:02] !log depool elastic2052. failing to join cluster due to `PROBLEM - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0` [16:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:58] (03PS1) 10Ssingh: hiera: decom ganeti4001 [puppet] - 10https://gerrit.wikimedia.org/r/841553 (https://phabricator.wikimedia.org/T317249) [16:21:06] (03PS1) 10Ahmon Dancy: P:gitlab::runner: Do not quote the value of environment variables [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997) [16:21:45] 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) Happy quarterly planning season; I was wondering if there was any updated estimates on when this m... [16:23:19] !log volans@cumin2002 conftool action : set/pooled=no; selector: name=elastic2052..* [16:23:57] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.5 refs T314194 (duration: 33m 55s) [16:24:02] T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194 [16:26:00] !log dduvall@deploy1002 Pruned MediaWiki: 1.40.0-wmf.3 (duration: 02m 00s) [16:29:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P35403 and previous config saved to /var/cache/conftool/dbconfig/20221011-162920-ladsgroup.json [16:32:13] jbond/rzl: I have a puppet patch [16:32:31] dancy: hey, happy to deploy [16:32:42] Thanks! It is https://gerrit.wikimedia.org/r/c/operations/puppet/+/841554 [16:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [16:33:25] (03CR) 10Btullis: Add a new production images for spark and spark-operator (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:34:41] (03CR) 10RLazarus: [C: 03+2] P:gitlab::runner: Do not quote the value of environment variables [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997) (owner: 10Ahmon Dancy) [16:34:43] (03CR) 10Ahmon Dancy: "Pcc results: https://puppet-compiler.wmflabs.org/pcc-worker1002/37501/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997) (owner: 10Ahmon Dancy) [16:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:35:20] dancy: merging, will you want a manual puppet run anywhere? [16:35:30] Yes please. On all gitlab-runner* hosts. [16:35:46] can do [16:36:54] done on 1002, running the others in parallel [16:37:49] and done [16:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:39:14] rzl: Thanks. Looks like we need to configure puppet to restart the buildkitd service if that file changes. In the meantime can you restart buildkitd on those same targets? [16:39:23] ah sure [16:39:55] just a "systemctl restart buildkitd", yeah? [16:39:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:01] yeah [16:40:57] !log gitlab-runner[1002-1004,2002-2004] - systemctl restart buildkitd - T317997 [16:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:02] T317997: Support http_proxy, https_proxy and other proxy `build-arg:` options in blubber buildkit frontend - https://phabricator.wikimedia.org/T317997 [16:41:14] done [16:41:23] Thanks! Running a test build now [16:42:20] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [16:42:25] (03PS3) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) [16:43:01] PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:43:12] rzl: Works! Thanks for getting us unstuck! [16:43:15] \o/ [16:44:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:44:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P35404 and previous config saved to /var/cache/conftool/dbconfig/20221011-164427-ladsgroup.json [16:46:09] * topranks looking at above BGP status [16:47:04] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) a:03RKemper [16:47:51] Above BGP alert was doh1001. Has been back up for 2 mins, nothing to worry about. [16:51:55] thanks topranks :) [16:52:06] one day™ we will find why it happens :P [16:52:12] I guess forcing the mode didn't fix it :( [16:52:21] yeah... I was secretly hoping [16:52:33] the other thing being that it only happens with doh1001 [16:52:35] and no other host [16:52:36] it will be a story for the ages :) [16:53:38] topranks: I am tempted to try rebooting the host, you know, just because [16:53:42] I will do it I think [16:53:54] can't hurt at this stage [16:53:59] $ uptime 16:53:50 up 194 days, [16:54:00] so yeah [16:54:39] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482 [16:54:43] T320482: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 [16:54:46] !log depool and reboot doh1001 [16:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:03] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482 [16:55:39] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:55:44] ^ expected [16:55:45] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) Looks like there is [[ https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions | a documented procedure for DC Ops to follow ]]. @Papaul I've downtimed the host... [16:55:54] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) a:05RKemper→03None [16:57:16] (03PS1) 10Dduvall: P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) [16:58:02] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall) [16:58:50] (03PS2) 10Dduvall: P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) [16:58:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35405 and previous config saved to /var/cache/conftool/dbconfig/20221011-165933-ladsgroup.json [16:59:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:59:38] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [16:59:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:59:47] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance [16:59:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35406 and previous config saved to /var/cache/conftool/dbconfig/20221011-165955-ladsgroup.json [16:59:59] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall) [17:00:09] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:00:18] ^ should be resolving soon [17:01:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35407 and previous config saved to /var/cache/conftool/dbconfig/20221011-170121-ladsgroup.json [17:01:55] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 311, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:02:01] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:02:23] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:57] (03PS1) 10Ahmon Dancy: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) [17:07:30] (03CR) 10CI reject: [V: 04-1] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:08:44] (03PS2) 10Ahmon Dancy: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) [17:09:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:13:30] (03CR) 10Dduvall: [C: 03+1] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:13:48] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37502/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P35408 and previous config saved to /var/cache/conftool/dbconfig/20221011-171627-ladsgroup.json [17:24:29] (03CR) 10Btullis: Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [17:25:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:25:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [17:25:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:26:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35409 and previous config saved to /var/cache/conftool/dbconfig/20221011-172608-ladsgroup.json [17:26:13] T318959: Add fr_user index on flaggedrevs in production - https://phabricator.wikimedia.org/T318959 [17:28:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35410 and previous config saved to /var/cache/conftool/dbconfig/20221011-172822-ladsgroup.json [17:31:02] (03PS3) 10Dduvall: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy) [17:31:13] (03PS4) 10Dduvall: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy) [17:31:20] (03CR) 10Dduvall: Add type Wmflib::POSIX::Name (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy) [17:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P35411 and previous config saved to /var/cache/conftool/dbconfig/20221011-173134-ladsgroup.json [17:31:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenConfirm - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:32:39] dancy sorry for the delay (errand) but see rz.l sorted things for you now [17:32:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:32:58] Yep I'm all set. [17:33:08] cool :) [17:33:31] jbond: I do have https://gerrit.wikimedia.org/r/c/operations/puppet/+/841557 which I made a few minutes ago [17:33:46] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [17:34:21] jbond: And we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/841556 deployed too [17:34:38] * jbond looking [17:35:45] !log running homer "cr*-ulsfo*" commit "Gerrit 841533: sites.yaml: add dns4004 to anycast_neighbors" [17:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:04] (03PS3) 10Jbond: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:36:11] (03CR) 10Jbond: Restart buildkitd if its config files change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:37:46] !log completed homer run for "cr*-ulsfo*" commit 841533 [17:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:37:55] (03CR) 10Jbond: [C: 03+2] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy) [17:38:10] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall) [17:38:31] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [17:40:12] dancy: both merged and deployed [17:40:31] Thanks! I'm checking w/ Dan to see if anything needs a manual restart as a result. [17:40:39] ack [17:41:41] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ssingh) dns4004 has been commissioned. [17:41:48] Looks like restarts are happening automatically. Thanks for the help jbond and rzl. [17:42:09] 👍 [17:42:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:43:02] cool [17:45:15] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [17:46:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35412 and previous config saved to /var/cache/conftool/dbconfig/20221011-174641-ladsgroup.json [17:46:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [17:47:39] (03CR) 10Jbond: [C: 03+2] "lgtm will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy) [17:48:10] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:57:10] (03PS1) 10Jbond: systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 [17:58:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P35413 and previous config saved to /var/cache/conftool/dbconfig/20221011-175842-ladsgroup.json [17:58:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37503/console" [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond) [17:59:20] (03CR) 10CI reject: [V: 04-1] systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond) [18:00:04] dduvall and ^demon: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1800). [18:00:49] !log sudo gnt-node remove ganeti4001.ulsfo.wmnet [18:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4001.ulsfo.wmnet [18:01:58] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194) [18:02:00] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:02:19] (03PS2) 10Jbond: systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 [18:02:49] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot) [18:03:10] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:03:11] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [18:07:06] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:07:07] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.5 refs T314194 [18:07:13] T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194 [18:08:26] (03CR) 10Jbond: [C: 03+2] systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond) [18:09:27] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:09:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4001.ulsfo.wmnet [18:09:34] ssh-keygen -f "/home/jbond/.ssh/known_hosts.d/wmf-prod" -R "sretest1002.eqiad.wmnet" [18:09:35] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `ganeti4001.ulsfo.wmnet` - ganeti4001.ulsfo.wmnet (**PASS**) - Downtimed host... [18:10:17] (03CR) 10Ssingh: [C: 03+2] hiera: decom ganeti4001 [puppet] - 10https://gerrit.wikimedia.org/r/841553 (https://phabricator.wikimedia.org/T317249) (owner: 10Ssingh) [18:11:26] !log re-enable cr1-eqiad<->asw2-d-eqiad link for re-cabling - T313463 [18:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 [18:11:49] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ssingh) @RobH: ganeti4001 has been decommissioned. Thanks! [18:13:39] RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [18:13:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35414 and previous config saved to /var/cache/conftool/dbconfig/20221011-181348-ladsgroup.json [18:13:53] T318959: Add fr_user index on flaggedrevs in production - https://phabricator.wikimedia.org/T318959 [18:14:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:16:40] !log restarting blazegraph on wdqs1013 (BlazegraphFreeAllocatorsDecreasingRapidly) [18:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:51] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:19:01] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:19:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:29:02] https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/unit.pp#L69-L73~.~.~. [18:29:40] (03PS1) 10Ori: service::docker: allow runtime to be specified [puppet] - 10https://gerrit.wikimedia.org/r/841574 (https://phabricator.wikimedia.org/T316706) [18:29:42] (03PS1) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) [18:32:05] (03CR) 10CI reject: [V: 04-1] add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) (owner: 10Ori) [18:37:07] (03PS2) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) [18:41:34] (03PS1) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [18:53:02] (03PS1) 10Dduvall: P:gitlab::runner: Enforce Wmflib::POSIX::Variables type for environment [puppet] - 10https://gerrit.wikimedia.org/r/841578 [18:58:05] (03PS2) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [19:00:18] (03CR) 10CI reject: [V: 04-1] systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [19:11:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:11:26] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10cmooney) [19:11:32] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) 05Open→03In progress p:05Triage→03High [19:11:48] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:14:24] (03CR) 10EllenR: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan) [19:15:32] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:16:32] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) a:05Jclark-ctr→03None [19:20:47] (03PS1) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) [19:21:26] (03CR) 10CI reject: [V: 04-1] keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott) [19:23:23] (03PS2) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) [19:29:10] 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) [19:32:33] (03PS3) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [19:35:31] (03CR) 10Andrew Bogott: "this is useful if a user wants to script generation of application credentials or similar. i'm not sure it's strictly necessary, we could " [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott) [19:37:11] (03PS1) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) [19:38:06] (03PS2) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) [19:39:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [19:40:58] (03CR) 10CI reject: [V: 04-1] [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper) [19:41:15] (03Abandoned) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott) [19:43:20] (03PS3) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) [19:44:17] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37505/console" [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper) [19:45:20] (03PS4) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) [19:46:13] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37506/console" [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper) [19:46:51] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED [19:47:19] (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835667 (owner: 10PipelineBot) [19:49:30] (03PS5) 10Ryan Kemper: wdqs-test: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) [19:49:43] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs-test: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper) [19:51:00] 10SRE: rsyslog::conf puppet define types inserts an extraneous newline in the content param - https://phabricator.wikimedia.org/T320569 (10jhathaway) [19:51:09] 10SRE: rsyslog::conf puppet define types inserts an extraneous newline in the content param - https://phabricator.wikimedia.org/T320569 (10jhathaway) a:03jhathaway [19:54:38] (03PS1) 10Ryan Kemper: Revert "wdqs-test: try installing nginx w extras" [puppet] - 10https://gerrit.wikimedia.org/r/841518 [19:55:32] (03PS1) 10JHathaway: rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) [19:56:41] (03CR) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [19:57:07] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [19:57:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [19:57:53] (03CR) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall) [19:58:10] (03PS2) 10Ryan Kemper: Revert "wdqs-test: try installing nginx w extras" [puppet] - 10https://gerrit.wikimedia.org/r/841518 (https://phabricator.wikimedia.org/T313751) [19:58:23] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "sd" [puppet] - 10https://gerrit.wikimedia.org/r/841518 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper) [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T2000). [20:00:05] eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:27] I can deploy! :D [20:00:33] Go ahead! [20:00:53] * TheresNoTime waits on eigyan :) [20:01:24] (03PS2) 10Samtar: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan) [20:02:28] greetings all [20:02:45] o/ [20:03:08] eigyan: hi! :) [20:03:28] hey there TheresNoTime :) [20:03:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan) [20:04:39] (03Merged) 10jenkins-bot: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan) [20:05:07] !log samtar@deploy1002 Started scap: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]] [20:05:11] T320495: Undeploy GDI Safety Survey Wave 3 from EN, ES, FR, and PT wikis - https://phabricator.wikimedia.org/T320495 [20:05:31] !log samtar@deploy1002 samtar and essexigyan: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:05:57] eigyan: that's live on mwdebug1001, can you test? :) [20:06:19] will do! thank you! [20:06:50] 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Papaul) @bking this host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk. Another option is to check also if we have any disk similar... [20:07:19] All is well TheresNoTime [20:07:27] great, syncing [20:09:34] (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [20:10:11] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:11:37] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]] (duration: 06m 29s) [20:11:41] T320495: Undeploy GDI Safety Survey Wave 3 from EN, ES, FR, and PT wikis - https://phabricator.wikimedia.org/T320495 [20:11:53] that's live in production now eigyan, mind checking one last time? [20:12:05] will do! [20:12:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) > > Also you might like to update this section when convenient: https://wikitech.wikimedia.org/wiki/Dumps/Dump_ser... [20:13:49] We are looking good @there [20:14:04] We are looking good TheresNoTime [20:14:10] eigyan: great, all done then :) [20:14:30] Excellent, as always thanks for all your help:) [20:15:00] you're very welcome :) [20:16:15] (03PS2) 10JHathaway: rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) [20:18:23] I'll be around for a little while longer if there's any last-minute patches for deployment [20:23:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:25:11] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:25:26] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet [20:25:55] !log depooling git-ssh service backends - checking if monitoring will alert [20:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:39] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [20:26:52] !log close UTC late backport window [20:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] !log depooling git-ssh service backends with confctl - T296022 [20:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:27] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [20:30:25] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:13] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [20:32:21] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [20:32:32] ^ yea, that's what I wanted to test once again. the docs claim these are "temporay" [20:33:02] and that they would happen when adding new services. but that's not the case here. it's about properly depooling if you only have 1 backend [20:33:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:33:25] I am not sure it's possible to do it right [20:34:43] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [20:35:09] PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal [20:35:18] !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [20:35:24] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet [20:38:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:39:40] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal [20:39:40] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal [20:39:40] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal [20:39:40] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal [20:39:41] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway) [20:40:21] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:40:57] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:41:20] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:41:20] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:41:20] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:41:20] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [20:41:25] RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:52:14] mutante: hey [20:52:28] need help? [20:54:09] bblack: I am looking for a way to disable/remove an existing LVS service, but in a way that is still easy to revert and does not cause these alerts [20:54:09] basically, after all the config is deployed, you have to manually remove the final entry from IPVS itself from the CLI [20:54:31] it seems I can only do it the right way following https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service [20:54:36] if you end up later reverting the patches, the redepoyment will re-provision fine. but final decom is manual-only. [20:54:37] which starts with silencing those alerts [20:54:44] there are about 8 alerts though [20:54:49] not just the networking ones [20:55:06] 4 of them are "Compilation of file '/srv/config-master/pybal/codfw/git-ssh' is broken" [20:55:33] even though I have not done anything besides depool. but the special case is there is just one backend [20:55:55] (03PS1) 10Dduvall: P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694) [20:56:23] The "remove a loadbalanced service" thing also seems to kind of assume a "disocovery" service in places [20:56:28] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet [20:56:41] yeah I don't think a service can legitimately exist without a backend [20:57:06] so.. first I just wanted to remove it from DNS. thinking that is still easy to revert if you have to [20:57:09] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:57:15] but of course pybal will not like that either [20:57:17] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:57:23] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:57:35] then I wanted to test again what alerts I really get [20:57:47] when I depool the one backend [20:57:48] are we removing this fod good? [20:57:58] I hope so. yes. [20:57:59] git-ssh I mean [20:58:03] that's the goal [20:58:10] I didn't realize [20:58:14] I was just hoping I could just disable it for a week [20:58:21] before there are more patches [20:59:05] was there some planned phaseout I missed? [20:59:37] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:59:43] oh this is maybe not what I was thinking it was [20:59:49] I get it now, this is *just* for phab-vcs [20:59:51] it's https://phabricator.wikimedia.org/T296022 [21:00:02] yea, we want to keep gerrit and gitlab [21:00:05] for some reason I started thinking this was our gerrit ssh port somehow, indirectly :) [21:00:08] but disable repos on phab [21:00:34] no, it's just trying to reduce the number of places we have for git repos [21:00:40] and at the same time simplify the phab server setup [21:00:43] if you just want to disable it, you could leave all this lvs/dns stuff alone and just change the ferm rules on the phab hosts to not allow port 22 from anywhere? [21:00:53] although that probably still causes a monitoring alert somewhere to silence [21:01:40] hmm. ACK. right now I was bothered by the additional "Compilation of file '/srv/config-master/pybal/eqiad/git-ssh' is broken" type of alerts [21:01:45] yeah [21:01:46] but looks like 2 of them did go away [21:01:51] I think that's because the service has no backend [21:01:56] I kind of remember them from the past too [21:02:15] it just takes quite some time until that check realizes changes [21:02:34] right now I have pooled the backends in both DCs again [21:02:38] ok [21:03:01] there were 4 alerts (on puppetmasters) about the templates. now there are 2 [21:03:06] but still one per DC [21:05:34] which alerts? [21:06:09] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Feqiad%2Fgit-ssh [21:06:16] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster2001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Feqiad%2Fgit-ssh [21:06:31] maybe in a minute [21:06:42] that might be one of those ones that persists due to some error-state file [21:06:45] hmmm [21:07:09] I think I had to delete the error files before [21:07:16] and it happened every time I tried this :) [21:07:42] then I did it again :p [21:08:02] let me try to find the err file [21:08:37] 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10GMikesell-WMF) @TheresNoTime View a page on the Beta Cluster with a Phonos parser functi... [21:09:08] yeah /var/run/confd-template [21:09:41] basically: rm -f /var/run/confd-template/.git-ssh* [21:10:32] ACK, thanks. those are the ones. [21:10:54] !log puppetmaster2001: rm .*.err in /var/run/confd-template [21:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:00] just the .err files but same thing [21:11:27] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:12:01] yea, so, if I really want to disable it for a week, downtiming the checks for _any_ LVS service seems a bit bad [21:12:52] !log puppetmaster1001: rm .*.err in /var/run/confd-template [21:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:12] I will look at your suggestion to close port 22 [21:13:23] and maybe test what alerts then? [21:13:55] PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:42] probably a lot of functional checks will fail, including pybal healthchecks [21:14:56] but they should be ackable individually? [21:15:06] basically any checks that actually hit that port [21:16:15] ok, ack. if all those alerts are specific to my service then that's better [21:17:08] or.. I need to remove them from pybal config? [21:17:14] and then depool [21:18:09] I also wasn't sure if it's a bad idea to remove it from conftool-data if I only touch one of both data centers [21:18:48] it should also be ok if I just remove it from conftool-data for both DCs and revert if needed. hopefully it won't be needed [21:21:20] I will stop the sshd service on a backend to see the alert [21:21:38] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall) [21:22:22] !log phab2001 - systemctl stop ssh-phab; temp disable puppet [21:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:41] (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall) [21:23:18] Thx mutante! [21:23:40] yes thank you :) [21:23:53] np, cloud-only, heh [21:25:55] buildkitd is running on runner-1024.gitlab-runners.eqiad1.wikimedia.cloud now (after I ran run-puppet-agent) [21:27:09] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:27:11] nice! [21:27:21] PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:27:53] bblack: ^ this is the one when I just stop the backend or would firewall it. but yea, I can downtime those.. right [21:28:25] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:28:31] waits a couple more minutes for more.. and there it goes [21:28:50] this will be on every lvs server.. but just 2 per DC I guess [21:29:38] (03Abandoned) 10Dduvall: pipeline: Make blubberfile definitions slightly more coherent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708582 (owner: 10Dduvall) [21:30:00] ACKNOWLEDGEMENT - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal [21:30:00] ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal [21:30:03] yup! [21:30:26] not using cookbook but good old Icinga web UI to downtime just those and not other stuff on the hosts [21:30:36] ack,ty [21:31:01] PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:36:53] !log phab1001 / phab2001 - temp. disabled puppet; stopped ssh-phab service; scheduled icinga downtimes for ssh-phab pybal backend alerts - effectively "soft shutting down" the service - T296022 [21:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:58] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [21:41:15] ACKNOWLEDGEMENT - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal [21:41:15] ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal [21:41:52] (03PS1) 10Dzahn: phabricator: stop ssh-phab service [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022) [21:42:32] (03CR) 10Dzahn: [C: 03+2] phabricator: stop ssh-phab service [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [21:44:49] (03CR) 10Dzahn: [C: 03+2] "puppet re-enabled" [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn) [21:47:48] (03CR) 10Ori: "Thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [21:58:05] (03PS4) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [21:59:14] (03PS5) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [22:00:22] (03CR) 10Jbond: systemd::override: Add new helper define for overrides (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:02:20] (03CR) 10CI reject: [V: 04-1] systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:03:16] (03PS6) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [22:04:44] (03PS7) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [22:06:22] (03PS8) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [22:10:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37507/console" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:10:43] (03CR) 10Jbond: "Its late but i think this should be ready to review, i realised so probably wont get to merge until Thursday but it should be a noop for c" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:11:52] (03CR) 10Jbond: systemd::override: Add new helper define for overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:13:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37508/console" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond) [22:15:01] RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:21:17] (03PS9) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 [22:32:09] RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:00:09] 10SRE, 10GitLab, 10Infrastructure-Foundations, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808) [23:22:15] (03CR) 10Cwhite: [C: 03+1] "LGTM! PCC noop: https://puppet-compiler.wmflabs.org/pcc-worker1003/37509/" [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)