[00:00:45] <wikibugs>	 (03PS21) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040)
[00:01:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:02:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:04:19] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:09:19] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:11:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:16:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: drop_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[00:21:02] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1251.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:21:38] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on db2100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1288.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[00:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[00:42:52] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:17:46] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:22:04] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:22:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on db2100 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:33:14] <icinga-wm>	 PROBLEM - Check systemd state on dbprov2002 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:42:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:52:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0200)
[02:03:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:04:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:04:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:05:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:07:42] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194)
[02:07:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[02:07:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:12:04] <icinga-wm>	 RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[02:27:30] <icinga-wm>	 RECOVERY - Check systemd state on dbprov2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:31:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:37:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:38:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0300)
[03:06:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[03:07:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[03:07:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[03:09:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[03:12:38] <icinga-wm>	 PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:33:06] <icinga-wm>	 RECOVERY - dump of matomo in eqiad on backupmon1001 is OK: Last dump for matomo at eqiad (db1108) taken on 2022-10-11 03:21:25 (1.2 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[03:51:06] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:08:58] <icinga-wm>	 RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[04:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[04:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[04:52:16] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0600).
[06:10:33] <wikibugs>	 (03PS2) 10KartikMistry: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156)
[06:34:14] <elukey>	 !log kill leftover process of bmansurov on an-airflow1002 to allow user cleanup via puppet
[06:34:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:58] <XioNoX>	 !log delete now unused VC ports on asw2-c4-eqiad - T313384
[06:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:02] <stashbot>	 T313384: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384
[06:37:42] <elukey>	 !log kill leftover process of bmansurov on stat1007 to allow user cleanup via puppet
[06:37:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:52] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-c-eqiad is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[06:43:15] <elukey>	 !log kill leftover process of nokafor on stat1004 to allow user cleanup via puppet
[06:43:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:51] <elukey>	 !log kill leftover process of jmads on stat1005 to allow user cleanup via puppet
[06:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:45:01] <wikibugs>	 (03CR) 10Santhosh: [C: 03+2] ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry)
[06:46:07] <wikibugs>	 (03Merged) 10jenkins-bot: ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry)
[06:52:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 (10ayounsi) [clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed
[06:52:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[06:52:35] <kart_>	 I'll deploy 839411, as it was +2'ed by mistake. Few minutes to go for Backport deployment window..
[06:53:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[06:53:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[06:53:51] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10ayounsi) [clinic duty] tagging the teams I think are relevant to this task, please change the tags as needed
[06:54:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[06:54:40] <wikibugs>	 10ops-eqiad, 10Data-Engineering: Check analytics1086's mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey)
[06:54:53] <wikibugs>	 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10elukey)
[06:57:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove LDAP access for aassaf [puppet] - 10https://gerrit.wikimedia.org/r/841389
[07:00:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for aassaf [puppet] - 10https://gerrit.wikimedia.org/r/841389 (owner: 10Muehlenhoff)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:26] * kart_ is here
[07:00:33] <kart_>	 will self deploy. Minor change.
[07:00:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839411 (https://phabricator.wikimedia.org/T319156) (owner: 10KartikMistry)
[07:01:02] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]]
[07:01:06] <stashbot>	 T319156: Make Mongolian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T319156
[07:01:59] <logmsgbot>	 !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[07:02:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247)
[07:09:58] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:839411|ContentTranslation: Make Mongolian Wikipedia MT stricter by 10% (T319156)]] (duration: 08m 56s)
[07:10:03] <stashbot>	 T319156: Make Mongolian Wikipedia Machine Translation stricter by 10% - https://phabricator.wikimedia.org/T319156
[07:10:54] <kart_>	 I'm done. 
[07:11:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[07:12:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[07:13:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:15:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[07:16:32] <ryankemper>	 !log [Elastic] Updated cross-cluster remote seeds (masters): `ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.eqiad.wmnet:9[2,4,6]43/_cluster/settings --ccc chi=chi_eqiad_masters.lst psi=psi_eqiad_masters.lst omega=omega_eqiad_masters.lst`
[07:16:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[07:17:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet
[07:17:53] <ryankemper>	 !log [Elastic] Forcing recheck of elastic settings check alerts; expecting a bit of noise as the alerts resolve (hopefully)
[07:17:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:18:00] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[07:18:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[07:18:41] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic1054 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic1074 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:43] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic1081 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:45] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:45] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic1094 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:45] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1095 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:47] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic1100 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:47] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1102 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:18:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[07:21:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[07:21:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[07:22:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[07:22:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:24:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[07:30:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[07:31:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[07:32:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:34:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[07:39:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[07:40:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[07:41:23] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[07:46:38] <wikibugs>	 (03Abandoned) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar)
[07:52:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Michael Schönitzer to contributors [puppet] - 10https://gerrit.wikimedia.org/r/841446 (https://phabricator.wikimedia.org/T308013)
[07:55:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Michael Schönitzer to contributors [puppet] - 10https://gerrit.wikimedia.org/r/841446 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:55:59] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:03:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[08:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[08:19:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841122 (owner: 10Jbond)
[08:32:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one note inline." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 (owner: 10Jbond)
[08:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[08:35:33] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #3 [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748)
[08:37:38] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ganeti4008.ulsfo.wmnet
[08:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[08:38:51] <wikibugs>	 (03CR) 10Muehlenhoff: "I fully trust your CSS/HTML expertise there :-) Could we capture that change in README.Debian as well, so that we are aware when we rebase" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond)
[08:41:36] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37493/console" [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[08:48:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:52:39] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #3 [puppet] - 10https://gerrit.wikimedia.org/r/841451 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[08:53:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:53:08] <vgutierrez>	 !log partitioning the ATS cache in cp1085, cp1086, cp2037, cp2038, cp3060, cp3061, cp4026, cp4030, cp5006, cp5012, cp6005, cp6013 - T317748
[08:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:13] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[08:59:33] <icinga-wm>	 RECOVERY - puppet last run on bast1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:59:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] cas: drop u2f support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841122 (owner: 10Jbond)
[08:59:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] 6.6.1: update files to prepare for 6.6.1 release [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841112 (owner: 10Jbond)
[09:04:24] <wikibugs>	 (03PS3) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181
[09:04:26] <wikibugs>	 (03PS1) 10Jbond: build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456
[09:05:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: nftables: basefirewall: typo [puppet] - 10https://gerrit.wikimedia.org/r/841457
[09:08:26] <wikibugs>	 (03PS4) 10Jbond: casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181
[09:09:19] <wikibugs>	 (03PS2) 10Jbond: build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456
[09:09:30] <wikibugs>	 (03PS8) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[09:09:59] <wikibugs>	 (03CR) 10Jbond: casLoginView.html: drop card properties (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond)
[09:10:11] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] casLoginView.html: Add original file from cas 6.6.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 (owner: 10Jbond)
[09:11:32] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37494/console" [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[09:19:40] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] maps: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/840139 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[09:28:25] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] P:gitlab::runner: Quote environment variable hash keys [puppet] - 10https://gerrit.wikimedia.org/r/841171 (https://phabricator.wikimedia.org/T317997) (owner: 10Dduvall)
[09:32:21] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2110 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 62453.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:33:41] <hoo>	 _joe_: Do you might having a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/841148?
[09:35:20] <wikibugs>	 (03PS1) 10Vgutierrez: hieradata: Remove cp4031 hiera file [puppet] - 10https://gerrit.wikimedia.org/r/841458 (https://phabricator.wikimedia.org/T301269)
[09:36:34] <_joe_>	 hoo: yup, is the maint script updated?
[09:37:21] <hoo>	 _joe_: Not yet... but is trivial to do (we'll make it accept, but ignore the options at first)
[09:40:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Bug in bridge-utils breaks IPv6 on interface if its not part of a bridge but vlan sub-int of it is - https://phabricator.wikimedia.org/T320429 (10aborrero) The Debian developer wanted to disable autogenerated IPv6 link local addresses on bridged interfaces.  Instead of disa...
[09:42:14] <_joe_>	 hoo: ok, so when the script doesn't error out with the parameters, ping me and we'll merge this
[09:42:30] <hoo>	 Nice, will take care of that :)
[09:44:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from cluster for decom
[09:44:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from cluster for decom
[09:45:39] <wikibugs>	 (03PS1) 10Ayounsi: Enable dhcp relay on ulsfo mgmt [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583)
[09:47:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hieradata: Remove cp4031 hiera file [puppet] - 10https://gerrit.wikimedia.org/r/841458 (https://phabricator.wikimedia.org/T301269) (owner: 10Vgutierrez)
[09:49:19] <icinga-wm>	 PROBLEM - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:49:20] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T320482 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:49:25] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10ops-monitoring-bot)
[09:49:31] <wikibugs>	 (03PS2) 10Ayounsi: Enable dhcp relay for mgmt network [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583)
[09:50:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable dhcp relay for mgmt network [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[09:53:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti1006 [puppet] - 10https://gerrit.wikimedia.org/r/841461 (https://phabricator.wikimedia.org/T320419)
[09:53:49] <wikibugs>	 (03PS1) 10Kosta Harlan: Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471)
[09:55:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:08] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[09:57:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10jbond) > , as we can drop to a regular shell and specify the MAC code manually: FYi you can also use the .ssh/config file whic...
[10:00:44] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master calculate apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943)
[10:00:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:02:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti1006 [puppet] - 10https://gerrit.wikimedia.org/r/841461 (https://phabricator.wikimedia.org/T320419) (owner: 10Muehlenhoff)
[10:02:58] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:06:02] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:07:39] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:08:18] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:11:18] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master remove apiserver_count [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943)
[10:12:03] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:12:46] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37496/console" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[10:13:18] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:22:08] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) I thought i would bring my response here.  > Setting skip_acked will also skip recheck_failed_services() Regardless of if we call `...
[10:26:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "With this patch we won't just disable pregeneration, but also disable cache invalidation. I would assume we'd need to just switch the meth" [deployment-charts] - 10https://gerrit.wikimedia.org/r/840097 (https://phabricator.wikimedia.org/T319365) (owner: 10Jgiannelos)
[10:27:14] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10JMcLeod_WMF)
[10:29:22] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464
[10:32:56] <wikibugs>	 10SRE, 10Security-Team, 10LDAP: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10Peachey88)
[10:33:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Run helm dependency build before packaging [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/826859 (https://phabricator.wikimedia.org/T316347) (owner: 10JMeybohm)
[10:35:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) >>! In T319277#8307039, @jbond wrote: > I thought i would bring my response here. >  >> Setting skip_acked will also skip recheck_f...
[10:39:06] <wikibugs>	 (03CR) 10Jbond: "removing -1 change seems fine to me however im not convinced this is the correct way to go, but have moved that comment to the task" [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede)
[10:39:44] <wikibugs>	 10SRE, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398 (10hnowlan) Dashboard for all of the relevant metrics to [[ https://wikitech.wikimedia.org/wiki/Incidents/2022-09-15_sessionstore_quorum_issues | the incident ]] that triggered t...
[10:40:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471) (owner: 10Kosta Harlan)
[10:41:10] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[10:42:16] <wikibugs>	 (03PS2) 10Zabe: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[10:42:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841145 (owner: 10Arturo Borrero Gonzalez)
[10:43:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[10:44:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467
[10:44:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (owner: 10Muehlenhoff)
[10:48:16] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (https://phabricator.wikimedia.org/T299459)
[10:53:29] <icinga-wm>	 PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 302 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[10:58:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1032 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841467 (https://phabricator.wikimedia.org/T299459) (owner: 10Muehlenhoff)
[11:02:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841145 (owner: 10Arturo Borrero Gonzalez)
[11:03:29] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Skins: Config flag controls contributions link" [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841160 (https://phabricator.wikimedia.org/T320471) (owner: 10Kosta Harlan)
[11:03:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.5 [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/840575 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[11:05:49] <icinga-wm>	 RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
[11:10:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1032.eqiad.wmnet
[11:10:55] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudnet: merge host hiera overrides back into the profile" [puppet] - 10https://gerrit.wikimedia.org/r/841161
[11:11:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[11:11:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "cloudnet: merge host hiera overrides back into the profile" [puppet] - 10https://gerrit.wikimedia.org/r/841161 (owner: 10Arturo Borrero Gonzalez)
[11:12:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:12:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:12:47] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond)
[11:13:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] nftables: basefirewall: typo [puppet] - 10https://gerrit.wikimedia.org/r/841457 (owner: 10Arturo Borrero Gonzalez)
[11:13:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:15:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:19:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1032.eqiad.wmnet
[11:19:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T314041)', diff saved to https://phabricator.wikimedia.org/P35394 and previous config saved to /var/cache/conftool/dbconfig/20221011-111954-ladsgroup.json
[11:19:59] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:20:51] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10Volans) I understand your concerns  > Regardless of if we call wait_for_optimal(True) or wait_for_optimal(False) we should always call rec...
[11:20:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[11:20:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db2110 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:21:20] <wikibugs>	 (03CR) 10Volans: "Adding a couple of more comments, but let's see what we agree on in the task first." [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede)
[11:21:31] <wikibugs>	 (03PS9) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[11:23:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841180 (owner: 10Jbond)
[11:26:05] <wikibugs>	 (03Abandoned) 10Hokwelum: Add labstore1006 to dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/840158 (https://phabricator.wikimedia.org/T319269) (owner: 10Hokwelum)
[11:26:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A
[11:26:42] <wikibugs>	 (03PS10) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[11:27:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1032.eqiad.wmnet to cluster eqiad and group A
[11:28:13] <wikibugs>	 (03PS11) 10Vlad.shapik: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016)
[11:29:25] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) @MatthewVernon continuing from T316845, (and I know I'm pushing my luck he...
[11:35:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35395 and previous config saved to /var/cache/conftool/dbconfig/20221011-113501-ladsgroup.json
[11:36:36] <wikibugs>	 (03PS1) 10Ladsgroup: Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955)
[11:37:43] <wikibugs>	 (03CR) 10Vlad.shapik: Update the logic to run code coverage (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[11:37:57] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff)
[11:46:37] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) Thanks for tracking all this John.  As you know most of our hosts just have a single interface with single unica...
[11:47:21] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[11:47:34] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MatthewVernon) So that `render` is coming from the `zone` setting in your `rewrite.py` c...
[11:47:38] <wikibugs>	 (03PS1) 10Hnowlan: haproxy: fix apt repository path [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841477 (https://phabricator.wikimedia.org/T233196)
[11:48:55] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[11:49:32] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841478
[11:49:58] <wikibugs>	 (03PS1) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479
[11:50:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P35396 and previous config saved to /var/cache/conftool/dbconfig/20221011-115007-ladsgroup.json
[11:50:45] <wikibugs>	 (03PS11) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[11:52:01] <wikibugs>	 (03PS2) 10Jbond: wmflib: add new functions to update a hash with randome secrets [puppet] - 10https://gerrit.wikimedia.org/r/841479
[11:53:03] <Lucas_WMDE>	 someone™ still needs to pull and sync this wmf.5 backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841160
[11:53:12] <Lucas_WMDE>	 I can do it later but not right now
[11:53:22] <Lucas_WMDE>	 if anyone else wants to do it in the meantime :)
[11:54:37] <wikibugs>	 (03CR) 10Hnowlan: thumbor: new service chart (0333 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[11:56:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) > AFAIK this configures the ssh daemon to accept connections using this protocol (possibly also configures outbound c...
[12:00:42] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:01:56] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudnet: merge host hiera overrides back into the profile [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284)
[12:02:42] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10MatthewVernon) I //think// `global-data-phonos-render` is likely the correct location (p...
[12:02:53] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307389, @cmooney wrote: > I'm not sure if this task is the best place to discuss this but I'm of t...
[12:05:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T314041)', diff saved to https://phabricator.wikimedia.org/P35397 and previous config saved to /var/cache/conftool/dbconfig/20221011-120514-ladsgroup.json
[12:05:19] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[12:05:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1003/37498/" [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[12:08:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[12:09:54] <wikibugs>	 (03PS12) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196)
[12:10:52] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10MoritzMuehlenhoff) >>! In T234207#8307389, @cmooney wrote: > Thanks for tracking all this John. >  > So for instance we c...
[12:13:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:13:08] <wikibugs>	 (03CR) 10Hnowlan: admin: add thumbor namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/824473 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan)
[12:13:56] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi)
[12:16:23] <icinga-wm>	 RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:22:46] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) >>! In T234207#8307423, @jbond wrote: > Perhaps from the netbox PoV but from any new (networkd) module should su...
[12:30:23] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10jbond) >>! In T234207#8307431, @MoritzMuehlenhoff wrote: >>>! In T234207#8307389, @cmooney wrote: >> Thanks for tracking...
[12:32:00] <wikibugs>	 (03PS1) 10Hoo man: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423)
[12:32:30] <wikibugs>	 (03PS1) 10Hoo man: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423)
[12:32:44] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Partition cache in one server per DC and cluster #4 [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748)
[12:33:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, optional improvement inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff)
[12:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[12:35:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841478 (https://phabricator.wikimedia.org/T316284) (owner: 10Arturo Borrero Gonzalez)
[12:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:38:55] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37499/console" [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[12:39:23] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[12:42:50] <Lucas_WMDE>	 jouncebot: now
[12:42:51] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 17 minute(s)
[12:42:55] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489
[12:43:13] <Lucas_WMDE>	 alright, I would pull and sync the wmf.5 backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/841160 now
[12:43:35] <Lucas_WMDE>	 it shouldn’t have any effect yet, but my understanding is that, since wmf.5 already exists on the servers, this should also be synced after being merged into the branch
[12:44:26] <Lucas_WMDE>	 though I’m not 100% sure about that, because the “Branch commit for wmf/1.40.0-wmf.5” (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/840575) apparently hasn’t been pulled+synced yet either
[12:45:28] <Lucas_WMDE>	 hm, actually, extensions/ and skins/ are empty apart from a README so far
[12:45:44] <Lucas_WMDE>	 so I’d have to do a sync-world, probably
[12:46:01] <Lucas_WMDE>	 I think I’ll leave that for the train deployers, then :)
[12:46:35] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Partition cache in one server per DC and cluster #4 [puppet] - 10https://gerrit.wikimedia.org/r/841486 (https://phabricator.wikimedia.org/T317748) (owner: 10Vgutierrez)
[12:46:53] <vgutierrez>	 !log partitioning the ATS cache in cp[2035-2036], cp[6004,6012], cp[1083-1084], cp[5005,5011], cp[3058-3059], cp[4025,4029] - T317748
[12:46:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:58] <stashbot>	 T317748: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748
[12:50:09] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10SLyngshede-WMF) 05Open→03In progress
[12:50:57] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489 (owner: 10Jgiannelos)
[12:51:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:52:05] <wikibugs>	 (03PS6) 10Samtar: swift: Add deployment-prep_hosts.yaml [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845)
[12:53:30] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) a:03SLyngshede-WMF
[12:53:59] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10SLyngshede-WMF) Note to myself: Check if this is still an issue, and if yes, are we still working on it.
[12:55:00] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10SLyngshede-WMF) a:03SLyngshede-WMF
[12:55:13] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/841489 (owner: 10Jgiannelos)
[12:55:38] <hoo>	 FYI: I will come a bit later for SWAT (but will do my patches on my own)
[12:55:42] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10SLyngshede-WMF) a:03SLyngshede-WMF
[12:56:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:57:20] <wikibugs>	 (03PS1) 10Kosta Harlan: AddContributeCardEntryPoint: Use RequestContext::getMain [extensions/ContentTranslation] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841509 (https://phabricator.wikimedia.org/T319327)
[12:57:39] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) With the utmost thanks to @MatthewVernon and everyone else who has comment...
[12:57:43] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464
[12:57:47] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.changedisk: Correct RAPI call (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff)
[12:58:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:40] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10SLyngshede-WMF) a:03SLyngshede-WMF
[12:58:51] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[12:59:20] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[12:59:22] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) 05In progress→03Resolved This should be resolved now ill tentativly close it, thanks for the ping and please re-open if there are sti...
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1300).
[13:00:05] <jouncebot>	 stephanebisson and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1300)
[13:00:08] <Lucas_WMDE>	 o/
[13:00:19] <Lucas_WMDE>	 I can deploy
[13:00:33] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:00:34] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) > One thing i forgot to highlight is thet tere is currently a bit of a chicken/egg issue of using interface_auto...
[13:00:56] <stephanebisson>	 Hello
[13:01:34] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:01:34] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[13:01:39] <Lucas_WMDE>	 hi stephanebisson 
[13:01:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff)
[13:01:56] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:02:25] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond) > add an option to test on one random (or possibly hardcoded) host from both the cloud and wmcs environments This is sti...
[13:02:35] <wikibugs>	 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: Improve PCC support for cloud VPS environments - https://phabricator.wikimedia.org/T289666 (10jbond)
[13:02:45] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:02:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.changedisk: Correct RAPI call [cookbooks] - 10https://gerrit.wikimedia.org/r/841464 (owner: 10Muehlenhoff)
[13:03:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[13:03:54] <wikibugs>	 (03Merged) 10jenkins-bot: Make discovery mode config default to 'off' [extensions/Wikistories] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/840178 (https://phabricator.wikimedia.org/T314582) (owner: 10Sbisson)
[13:05:20] <Lucas_WMDE>	 fetching
[13:06:04] <Lucas_WMDE>	 stephanebisson: the change should be on mwdebug1001, can you test it?
[13:06:17] <stephanebisson>	 Lucas_WMDE, yes, on it
[13:06:22] <Lucas_WMDE>	 thanks
[13:07:27] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:07:41] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:08:28] <stephanebisson>	 Lucas_WMDE Looks good
[13:08:58] <Lucas_WMDE>	 ok
[13:10:03] <wikibugs>	 (03PS1) 10JMeybohm: Randomize tokens in profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/841494
[13:10:05] <Lucas_WMDE>	 nemo-yiannis: (assuming you’re jgiannelos) are you looking into the mobileapps alert that icinga just posted above?
[13:10:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:10:16] <nemo-yiannis>	 yeah, i am going to revert 
[13:10:21] <Lucas_WMDE>	 ok, just checking :)
[13:10:27] <Lucas_WMDE>	 I’ll  proceed with the backport then
[13:10:42] * Lucas_WMDE forgot to use scap backport again
[13:11:13] <Lucas_WMDE>	 syncing
[13:11:33] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495
[13:11:50] <wikibugs>	 (03PS1) 10Jgiannelos: Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510
[13:12:08] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Implement Prometheus exporter for Ganeti capacity data - https://phabricator.wikimedia.org/T311288 (10SLyngshede-WMF) 05In progress→03Resolved
[13:12:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:12:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:12:47] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510 (owner: 10Jgiannelos)
[13:13:45] <logmsgbot>	 !log volans@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[13:13:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm)
[13:14:14] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.hosts.provision for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[13:14:42] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10jbond) > The original idea was that we don't want to ignore ack'ed alerts blindly Im not sure this was the original idea going from the ta...
[13:14:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.40.0-wmf.4/extensions/Wikistories/extension.json: Backport: [[gerrit:840178|Make discovery mode config default to 'off' (T314582)]] (duration: 03m 48s)
[13:15:03] <stashbot>	 T314582: Make Wikistories configurable for public release - https://phabricator.wikimedia.org/T314582
[13:15:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:15:29] <Lucas_WMDE>	 ok, that’s done
[13:15:39] <Lucas_WMDE>	 hoo will self-service later
[13:16:00] <stephanebisson>	 Lucas_WMDE thank you!
[13:16:04] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495
[13:16:06] <Lucas_WMDE>	 np :)
[13:16:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mobileapps: Bump to latest version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841510 (owner: 10Jgiannelos)
[13:17:13] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[13:17:29] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[13:17:34] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[13:18:15] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[13:18:23] <wikibugs>	 (03CR) 10jenkins-bot: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm)
[13:18:35] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[13:18:45] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:19:16] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[13:19:39] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) Verified Netbox  Thanks
[13:19:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Jclark-ctr) 05Open→03Resolved
[13:19:55] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Jclark-ctr)
[13:20:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841495 (owner: 10JMeybohm)
[13:20:41] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[13:21:36] <nemo-yiannis>	 Ok it looks like mobileapps is not complaining any more, i will push a fix and re-deploy
[13:21:48] <wikibugs>	 (03PS3) 10Ssingh: P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067)
[13:22:48] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37500/console" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[13:23:18] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[13:23:30] <wikibugs>	 (03PS3) 10Jbond: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[13:23:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto)
[13:25:31] <wikibugs>	 (03PS4) 10JMeybohm: kubernetes::master fail if user tokens are not unique [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943)
[13:26:56] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Randomize tokens in profile::kubernetes::infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/841494 (owner: 10JMeybohm)
[13:30:41] <wikibugs>	 (03PS1) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247)
[13:33:27] <wikibugs>	 (03PS1) 10BBlack: Temporary rate exemption for IABot source IPs [puppet] - 10https://gerrit.wikimedia.org/r/841499 (https://phabricator.wikimedia.org/T318065)
[13:33:43] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10jbond) im not sure if i ever looked at this task, however i do notice that i have an old close PR for stdlib which seems related...
[13:34:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[13:35:15] <wikibugs>	 (03PS1) 10Ssingh: hiera: decommission dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/841500 (https://phabricator.wikimedia.org/T320440)
[13:37:11] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns4002 [puppet] - 10https://gerrit.wikimedia.org/r/841500 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh)
[13:37:21] <wikibugs>	 (03PS3) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904)
[13:37:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by hoo@deploy1002 using scap backport" [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man)
[13:40:41] <wikibugs>	 (03PS3) 10Ayounsi: Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583)
[13:41:07] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440)
[13:41:33] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Temporary rate exemption for IABot source IPs [puppet] - 10https://gerrit.wikimedia.org/r/841499 (https://phabricator.wikimedia.org/T318065) (owner: 10BBlack)
[13:41:48] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502
[13:42:57] <wikibugs>	 (03CR) 10Jgiannelos: "This is related to the production errors triggered from this deployment:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (owner: 10Jgiannelos)
[13:43:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! I'll rebase https://gerrit.wikimedia.org/r/c/operations/puppet/+/841134 after you've merged" [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[13:44:03] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:base: configure Linux 5.10 on buster via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/840162 (https://phabricator.wikimedia.org/T319067) (owner: 10Ssingh)
[13:44:18] <wikibugs>	 (03CR) 10Jgiannelos: "This is related to this mobileapps change:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (owner: 10Jgiannelos)
[13:44:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904) (owner: 10BBlack)
[13:44:29] <wikibugs>	 (03PS4) 10BBlack: Add wikifunctions to Varnish as a 302 [puppet] - 10https://gerrit.wikimedia.org/r/822657 (https://phabricator.wikimedia.org/T275904)
[13:47:23] <wikibugs>	 (03Merged) 10jenkins-bot: Update the logic to run code coverage [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/833426 (https://phabricator.wikimedia.org/T313016) (owner: 10Vlad.shapik)
[13:49:14] <wikibugs>	 (03PS2) 10Ssingh: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841134 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[13:49:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Manuel)
[13:50:14] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs4008.mgmt.ulsfo.wmnet with reboot policy FORCED
[13:50:47] <icinga-wm>	 RECOVERY - SSH on restbase1028 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:50:55] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:51:27] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:51:33] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:51:33] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:51:33] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:51:39] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:51:55] <icinga-wm>	 RECOVERY - Restbase root url on restbase1028 is OK: HTTP OK: HTTP/1.1 200 - 17317 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[13:52:13] <icinga-wm>	 PROBLEM - puppet last run on restbase1028 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:52:18] <wikibugs>	 (03PS1) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468)
[13:53:00] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[13:53:11] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:53:41] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1028 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:53:53] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1028 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:54:00] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502
[13:54:05] <wikibugs>	 (03PS2) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468)
[13:54:45] <wikibugs>	 (03PS3) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505)
[13:55:09] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.0.209:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.209 port 9042 https://phabricator.wikimedia.org/T93886
[13:56:01] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.0.209:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-a valid until 2024-08-30 21:25:17 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:56:01] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.0.210:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-b valid until 2024-08-30 21:25:20 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:56:01] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.0.211:7001 on restbase1028 is OK: SSL OK - Certificate restbase1028-c valid until 2024-08-30 21:25:22 +0000 (expires in 689 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[13:56:15] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.210:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.210 port 9042 https://phabricator.wikimedia.org/T93886
[13:56:33] <wikibugs>	 (03Merged) 10jenkins-bot: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.4) - 10https://gerrit.wikimedia.org/r/841164 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man)
[13:56:37] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.0.211:9042 on restbase1028 is OK: TCP OK - 0.000 second response time on 10.64.0.211 port 9042 https://phabricator.wikimedia.org/T93886
[13:56:47] <logmsgbot>	 !log hoo@deploy1002 Started scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]]
[13:56:53] <stashbot>	 T238751: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751
[13:56:53] <stashbot>	 T315423: Revive and merge patch to update maxlag calculation - https://phabricator.wikimedia.org/T315423
[13:57:07] <logmsgbot>	 !log hoo@deploy1002 hoo and hoo: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:58:12] <wikibugs>	 (03PS7) 10Giuseppe Lavagetto: Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[13:58:27] <icinga-wm>	 RECOVERY - puppet last run on restbase1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:58:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Stub of the new organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto)
[13:59:29] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ssingh)
[14:00:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:01:45] <logmsgbot>	 !log hoo@deploy1002 Finished scap: Backport for [[gerrit:841164|updateQueryServiceLag: Add lb(-pool) options for forward compatibility (T315423 T238751)]] (duration: 04m 57s)
[14:02:38] <wikibugs>	 (03CR) 10Hoo man: [C: 03+2] "Branch is not deployed yet" [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man)
[14:02:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack: keystone: enable app credentials everywhere [puppet] - 10https://gerrit.wikimedia.org/r/840121 (https://phabricator.wikimedia.org/T294195) (owner: 10Majavah)
[14:03:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:03:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:05:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:06:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[14:08:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Bump the chart version too, otherwise this isn't going to be deployable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos)
[14:10:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh)
[14:10:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[14:11:01] <wikibugs>	 (03PS3) 10Andrew Bogott: P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 (owner: 10Majavah)
[14:11:36] <wikibugs>	 (03Merged) 10jenkins-bot: Management routers: replace bootp with dhcp-relay [homer/public] - 10https://gerrit.wikimedia.org/r/841460 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi)
[14:12:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: decom dns4002 [homer/public] - 10https://gerrit.wikimedia.org/r/841501 (https://phabricator.wikimedia.org/T320440) (owner: 10Ssingh)
[14:14:25] <sukhe>	 !log homer "cr*-ulsfo*" commit "Gerrit 841501: sites.yaml: decom dns4002"
[14:14:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:18] <sukhe>	 !log completed homer run for Gerrit 841501
[14:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:23] <icinga-wm>	 RECOVERY - BFD status on cr3-ulsfo is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:16:21] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 82, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:23] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 105, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:16:31] <sukhe>	 ^ stems from Gerrit 841501
[14:17:13] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:17:16] <wikibugs>	 (03PS2) 10Muehlenhoff: Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247)
[14:19:29] <wikibugs>	 (03Merged) 10jenkins-bot: updateQueryServiceLag: Add lb(-pool) options for forward compatibility [extensions/Wikidata.org] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841165 (https://phabricator.wikimedia.org/T315423) (owner: 10Hoo man)
[14:19:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti4008 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/841390 (https://phabricator.wikimedia.org/T317247) (owner: 10Muehlenhoff)
[14:20:48] <wikibugs>	 (03PS1) 10Ssingh: sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247)
[14:22:18] <wikibugs>	 (03PS9) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with :<port> [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132
[14:22:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:mariadb::cloudinfra: remove direct access from puppetmaster hosts [puppet] - 10https://gerrit.wikimedia.org/r/831041 (owner: 10Majavah)
[14:22:45] <wikibugs>	 (03PS2) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247)
[14:23:18] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10ayounsi) For physical servers we indeed need to keep the whole lifecycle/provisioning process in mind (racking/provisioni...
[14:24:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos)
[14:25:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067)
[14:25:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:26:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet
[14:26:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:26:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:26:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[14:27:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:29:10] <wikibugs>	 10SRE, 10Discovery-Search, 10serviceops, 10serviceops-collab, and 2 others: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10LSobanski) a:03Dzahn
[14:30:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10HShaikh) Just saw that on the patch it was mentioned that the shell access might be needed. To give more context I am not sure if the data...
[14:30:25] <wikibugs>	 10SRE, 10Observability-Logging, 10Observability-Metrics, 10serviceops, 10Performance-Team (Radar): Framework for running experiments on a subset of the app server fleet - https://phabricator.wikimedia.org/T315403 (10LSobanski)
[14:30:35] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[14:30:56] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netops, 10User-jbond: Investigate improvements to how puppet manages network interfaces - https://phabricator.wikimedia.org/T234207 (10cmooney) >  Which means being able to map the real world interface to the logical one, from previous conversations it's o...
[14:31:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch profile::base::linux510 to the new meta package [puppet] - 10https://gerrit.wikimedia.org/r/841536 (https://phabricator.wikimedia.org/T319067) (owner: 10Muehlenhoff)
[14:33:22] <wikibugs>	 (03PS1) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537
[14:35:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Use profile::base::use_linux510_on_buster for cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/841538 (https://phabricator.wikimedia.org/T297814)
[14:39:34] <wikibugs>	 (03PS2) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537
[14:39:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond)
[14:39:46] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] build.gradle: add oidc support [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841456 (owner: 10Jbond)
[14:39:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] casLoginView.html: drop card properties [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/841181 (owner: 10Jbond)
[14:41:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 (owner: 10JMeybohm)
[14:42:02] <wikibugs>	 (03PS3) 10JMeybohm: Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537
[14:42:57] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Keep k8s tokens identifiable as dummys [labs/private] - 10https://gerrit.wikimedia.org/r/841537 (owner: 10JMeybohm)
[14:46:31] <icinga-wm>	 PROBLEM - Check systemd state on ganeti4008 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:47:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: latest netbox hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/841541
[14:47:37] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860)
[14:48:11] <icinga-wm>	 RECOVERY - Check systemd state on ganeti4008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:49:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet
[14:49:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: latest netbox hiera fixes [puppet] - 10https://gerrit.wikimedia.org/r/841541 (owner: 10Filippo Giunchedi)
[14:50:43] <XioNoX>	 !log disable cr1-eqiad<->asw2-c-eqiad link for optic replacement
[14:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1
[14:51:22] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1
[14:52:42] <wikibugs>	 (03PS4) 10Jgiannelos: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505)
[14:53:35] <wikibugs>	 (03CR) 10Elukey: "For everybody's context, from the kube-api's help:" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:54:38] <dancy>	 jouncebot nowandnext
[14:54:38] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 5 minute(s)
[14:54:39] <jouncebot>	 In 1 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1600)
[14:56:40] <XioNoX>	 !log re-enable cr1-eqiad<->asw2-c-eqiad link after optic replacement
[14:56:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/841463 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:58:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM (modulo pcc running and showing no failures)" [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:58:56] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[15:00:26] <wikibugs>	 (03PS3) 10Ssingh: dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247)
[15:00:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: probe mgmt network from netmon host [puppet] - 10https://gerrit.wikimedia.org/r/841542 (https://phabricator.wikimedia.org/T169860)
[15:01:05] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:01:53] <wikibugs>	 (03PS3) 10Elukey: istio: reduce Envoy logspam [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468)
[15:02:13] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[15:02:48] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "Cherry-picked to beta cluster, appears to be working well" [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe)
[15:03:40] <wikibugs>	 (03CR) 10Samtar: "Also cherry-picked to beta, would appreciate a second set of eyes on the config but I think it's reasonable finally 😊" [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar)
[15:03:59] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns4004: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/841496 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[15:04:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1
[15:04:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[15:05:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns4004.wikimedia.org with OS buster
[15:05:11] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster
[15:05:14] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] "Cherry-picked to beta cluster, appears to be working well" [puppet] - 10https://gerrit.wikimedia.org/r/831955 (https://phabricator.wikimedia.org/T317417) (owner: 10MusikAnimal)
[15:07:17] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10TheresNoTime) Acceptance criteria, issue not present — see also T314294
[15:07:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Check analytics1086 mgmt's cable - https://phabricator.wikimedia.org/T320458 (10BTullis) Just for good measure, I have carried out a cold reset of the IPMI controller with: ` btullis@an-worker1086:~$ sudo bmc-device --cold-reset; echo $? 0 ` I'll check again to see whe...
[15:07:45] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ssingh) @RobH: I think we can mark this as resolved as all the Puppet configuration has been removed and you already ran the decom cookbook. Deferring this to you in case something el...
[15:08:47] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) Replaced Optic in asw2-c2 port 53. cleaned fiber both ends
[15:08:59] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr
[15:09:10] <XioNoX>	 !log disable cr1-eqiad<->asw2-d-eqiad link for re-cabling - T313463
[15:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:14] <stashbot>	 T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463
[15:10:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[15:11:23] <wikibugs>	 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) 05Open→03Resolved
[15:11:27] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH)
[15:14:40] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "This looks good to me, thanks for your patience with getting this all working!" [puppet] - 10https://gerrit.wikimedia.org/r/836953 (https://phabricator.wikimedia.org/T316845) (owner: 10Samtar)
[15:14:41] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Export confd template status as Prometheus metrics - https://phabricator.wikimedia.org/T319272 (10fgiunchedi) 05Open→03Resolved
[15:14:44] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi)
[15:16:01] <icinga-wm>	 PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 3 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[15:16:23] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[15:16:28] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "This looks correct to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/828664 (https://phabricator.wikimedia.org/T316845) (owner: 10Zabe)
[15:17:17] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[15:19:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:21:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos)
[15:21:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for hshaikh and ptiwary - https://phabricator.wikimedia.org/T319326 (10Ottomata) > is available in superset It is available in superset, however because it is not a 'small' dataset, it is possible that it might...
[15:21:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955) (owner: 10Ladsgroup)
[15:22:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add drop_fr_comment_fr_text_T318955.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/841475 (https://phabricator.wikimedia.org/T318955) (owner: 10Ladsgroup)
[15:23:06] <wikibugs>	 (03PS1) 10Samtar: InitialiseSettings-labs: Enable Phonos on en_rtlwiki, enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294)
[15:23:08] <wikibugs>	 (03PS22) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040)
[15:23:32] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[15:24:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:25:21] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1
[15:25:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance
[15:26:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2101.codfw.wmnet with reason: Maintenance
[15:26:22] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns4004.wikimedia.org with reason: host reimage
[15:27:07] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos)
[15:27:21] <TheresNoTime>	 Hi, I'm going to deploy a beta cluster only change ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/841547/ ) in a moment — any reasons not to? :)
[15:27:52] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10MoritzMuehlenhoff) I have setup ganeti4008 as a node in the ulsfo Ganeti cluster and moved a VM to it to confirm it works as expected.
[15:27:58] <dancy>	 None that I'm aware of
[15:29:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[15:29:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[15:30:00] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Enable Phonos on en_rtlwiki, enwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841547 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar)
[15:30:59] <TheresNoTime>	 !log deployed beta cluster only change, [[gerrit:841547]], for T314294
[15:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:04] <stashbot>	 T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294
[15:31:27] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Move hardcoded values to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/841502 (https://phabricator.wikimedia.org/T320505) (owner: 10Jgiannelos)
[15:32:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:32:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:33:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:33:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:34:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:35:33] <sukhe>	 !log sudo gnt-node migrate -f ganeti4001.ulsfo.wmnet
[15:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:38:32] <sukhe>	 !log sudo gnt-node evacuate -s ganeti4001.ulsfo.wmnet
[15:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:57] <wikibugs>	 (03Abandoned) 10Dzahn: mediawiki::api: fix kernel parameter name ip_local_port_range [puppet] - 10https://gerrit.wikimedia.org/r/831629 (https://phabricator.wikimedia.org/T317454) (owner: 10Dzahn)
[15:41:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:42:45] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:44:10] <ottomata>	 !log remove materialized .json files from schemas/event/secondary - this should be a no-op as no clients should actually be using the json files. - T315674
[15:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:15] <stashbot>	 T315674: Remove materialized .json files from event schema repositories - https://phabricator.wikimedia.org/T315674
[15:44:42] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495
[15:45:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] New organization of templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto)
[15:46:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:48:46] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Manuel - https://phabricator.wikimedia.org/T320504 (10Lea_WMDE) As the team lead of @Manuel I approve!
[15:48:50] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194)
[15:48:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[15:49:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance
[15:49:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2111.codfw.wmnet with reason: Maintenance
[15:49:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35401 and previous config saved to /var/cache/conftool/dbconfig/20221011-154934-ladsgroup.json
[15:49:38] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841548 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[15:49:39] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[15:49:42] <wikibugs>	 10SRE, 10InternetArchiveBot, 10Traffic: IABot is encountering 429 on Wikimedia Production - https://phabricator.wikimedia.org/T318065 (10Cyberpower678) 05Open→03Resolved a:03Cyberpower678
[15:49:54] <wikibugs>	 10SRE, 10InternetArchiveBot: Request for increase request limit for InternetArchiveBot - https://phabricator.wikimedia.org/T318284 (10Cyberpower678) 05Open→03Resolved a:03Cyberpower678
[15:49:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:50:02] <logmsgbot>	 !log dduvall@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.5  refs T314194
[15:50:03] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Looks like the 1.9.5-patch branch got deleted by upstream, they only offer 1.9.8-patch now.." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/841527 (https://phabricator.wikimedia.org/T320468) (owner: 10Elukey)
[15:50:06] <stashbot>	 T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194
[15:50:16] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: issue confd per-template alerts [alerts] - 10https://gerrit.wikimedia.org/r/841549 (https://phabricator.wikimedia.org/T314118)
[15:50:35] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host dns4004.wikimedia.org with OS buster
[15:50:41] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster completed: - dns4004 (...
[15:50:45] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns4004.wikimedia.org with OS buster executed with errors:...
[15:50:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:50:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:51:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:53:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I feel the outscome supported my point about merging things this way when we only find out later at reload if things worked." [puppet] - 10https://gerrit.wikimedia.org/r/839694 (owner: 10Hashar)
[15:55:56] <wikibugs>	 (03CR) 10JMeybohm: kubernetes::master fail if user tokens are not unique (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841495 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[15:55:58] <wikibugs>	 (03PS10) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807
[15:56:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[15:56:56] <wikibugs>	 (03PS8) 10Btullis: Add a new production images for spark and spark-operator [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730)
[15:57:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[15:57:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[15:57:42] <wikibugs>	 (03CR) 10Hashar: "Squashed in https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/events-wikimedia/+/814807/10 ;)" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 (owner: 10Hashar)
[15:58:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[15:59:52] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:46] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[16:00:59] <wikibugs>	 (03Abandoned) 10Btullis: Add a spark-operator production image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:02:56] <wikibugs>	 (03CR) 10Btullis: Add a spark-operator production image (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838858 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:05:10] <wikibugs>	 (03PS1) 10Eigyan: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495)
[16:05:42] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10Jclark-ctr) 05Open→03Resolved
[16:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[16:14:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35402 and previous config saved to /var/cache/conftool/dbconfig/20221011-161414-ladsgroup.json
[16:14:20] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:14:49] <wikibugs>	 (03PS1) 10Matthias Mullie: Rescale images based on width alone [core] (wmf/1.40.0-wmf.5) - 10https://gerrit.wikimedia.org/r/841515 (https://phabricator.wikimedia.org/T320406)
[16:16:02] <ebernhardson>	 !log depool elastic2052. failing to join cluster due to `PROBLEM - MD RAID on elastic2052 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0`
[16:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:58] <wikibugs>	 (03PS1) 10Ssingh: hiera: decom ganeti4001 [puppet] - 10https://gerrit.wikimedia.org/r/841553 (https://phabricator.wikimedia.org/T317249)
[16:21:06] <wikibugs>	 (03PS1) 10Ahmon Dancy: P:gitlab::runner: Do not quote the value of environment variables [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997)
[16:21:45] <wikibugs>	 10SRE, 10Data Engineering Planning, 10Data Pipelines, 10Shared-Data-Infrastructure: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) Happy quarterly planning season; I was wondering if there was any updated estimates on when this m...
[16:23:19] <logmsgbot>	 !log volans@cumin2002 conftool action : set/pooled=no; selector: name=elastic2052..*
[16:23:57] <logmsgbot>	 !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.5  refs T314194 (duration: 33m 55s)
[16:24:02] <stashbot>	 T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194
[16:26:00] <logmsgbot>	 !log dduvall@deploy1002 Pruned MediaWiki: 1.40.0-wmf.3 (duration: 02m 00s)
[16:29:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P35403 and previous config saved to /var/cache/conftool/dbconfig/20221011-162920-ladsgroup.json
[16:32:13] <dancy>	 jbond/rzl:  I have a puppet patch 
[16:32:31] <rzl>	 dancy: hey, happy to deploy
[16:32:42] <dancy>	 Thanks!  It is https://gerrit.wikimedia.org/r/c/operations/puppet/+/841554
[16:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[16:33:25] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[16:34:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] P:gitlab::runner: Do not quote the value of environment variables [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997) (owner: 10Ahmon Dancy)
[16:34:43] <wikibugs>	 (03CR) 10Ahmon Dancy: "Pcc results: https://puppet-compiler.wmflabs.org/pcc-worker1002/37501/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841554 (https://phabricator.wikimedia.org/T317997) (owner: 10Ahmon Dancy)
[16:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:35:20] <rzl>	 dancy: merging, will you want a manual puppet run anywhere?
[16:35:30] <dancy>	 Yes please.  On all gitlab-runner* hosts.
[16:35:46] <rzl>	 can do
[16:36:54] <rzl>	 done on 1002, running the others in parallel
[16:37:49] <rzl>	 and done
[16:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[16:39:14] <dancy>	 rzl: Thanks.  Looks like we need to configure puppet to restart the buildkitd service if that file changes.  In the meantime can you restart buildkitd on those same targets?
[16:39:23] <rzl>	 ah sure
[16:39:55] <rzl>	 just a "systemctl restart buildkitd", yeah?
[16:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:40:01] <dancy>	 yeah
[16:40:57] <rzl>	 !log gitlab-runner[1002-1004,2002-2004] - systemctl restart buildkitd - T317997
[16:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:02] <stashbot>	 T317997: Support http_proxy, https_proxy and other proxy `build-arg:` options in blubber buildkit frontend - https://phabricator.wikimedia.org/T317997
[16:41:14] <rzl>	 done
[16:41:23] <dancy>	 Thanks! Running a test build now
[16:42:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott)
[16:42:25] <wikibugs>	 (03PS3) 10Andrew Bogott: Openstack Keystone: Expose the Keystone public API [puppet] - 10https://gerrit.wikimedia.org/r/838903 (https://phabricator.wikimedia.org/T319312)
[16:43:01] <icinga-wm>	 PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:43:12] <dancy>	 rzl: Works! Thanks for getting us unstuck!
[16:43:15] <rzl>	 \o/
[16:44:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:44:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P35404 and previous config saved to /var/cache/conftool/dbconfig/20221011-164427-ladsgroup.json
[16:46:09] * topranks looking at above BGP status
[16:47:04] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) a:03RKemper
[16:47:51] <topranks>	 Above BGP alert was doh1001.  Has been back up for 2 mins, nothing to worry about.
[16:51:55] <sukhe>	 thanks topranks :) 
[16:52:06] <sukhe>	 one day™ we will find why it happens :P
[16:52:12] <topranks>	 I guess forcing the mode didn't fix it :(
[16:52:21] <sukhe>	 yeah... I was secretly hoping
[16:52:33] <sukhe>	 the other thing being that it only happens with doh1001
[16:52:35] <sukhe>	 and no other host
[16:52:36] <topranks>	 it will  be a story for the ages :)
[16:53:38] <sukhe>	 topranks: I am tempted to try rebooting the host, you know, just because
[16:53:42] <sukhe>	 I will do it I think
[16:53:54] <topranks>	 can't hurt at this stage
[16:53:59] <sukhe>	 $ uptime 16:53:50 up 194 days, 
[16:54:00] <sukhe>	 so yeah
[16:54:39] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482
[16:54:43] <stashbot>	 T320482: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482
[16:54:46] <sukhe>	 !log depool and reboot doh1001
[16:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:03] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on elastic2052.codfw.wmnet with reason: T320482
[16:55:39] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:55:44] <sukhe>	 ^ expected
[16:55:45] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) Looks like there is [[ https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Sw_raid_rebuild_directions | a documented procedure for DC Ops to follow ]].  @Papaul I've downtimed the host...
[16:55:54] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10bking) a:05RKemper→03None
[16:57:16] <wikibugs>	 (03PS1) 10Dduvall: P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694)
[16:58:02] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall)
[16:58:50] <wikibugs>	 (03PS2) 10Dduvall: P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694)
[16:58:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:59:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T318955)', diff saved to https://phabricator.wikimedia.org/P35405 and previous config saved to /var/cache/conftool/dbconfig/20221011-165933-ladsgroup.json
[16:59:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[16:59:38] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[16:59:41] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:59:47] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[16:59:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2123.codfw.wmnet with reason: Maintenance
[16:59:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35406 and previous config saved to /var/cache/conftool/dbconfig/20221011-165955-ladsgroup.json
[16:59:59] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Use WMF fork of buildkit for buildkitd service [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall)
[17:00:09] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:00:18] <sukhe>	 ^ should be resolving soon
[17:01:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35407 and previous config saved to /var/cache/conftool/dbconfig/20221011-170121-ladsgroup.json
[17:01:55] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 311, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:02:01] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:02:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:06:57] <wikibugs>	 (03PS1) 10Ahmon Dancy: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271)
[17:07:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:08:44] <wikibugs>	 (03PS2) 10Ahmon Dancy: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271)
[17:09:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:13:30] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:13:48] <wikibugs>	 (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37502/gitlab-runner1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:16:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P35408 and previous config saved to /var/cache/conftool/dbconfig/20221011-171627-ladsgroup.json
[17:24:29] <wikibugs>	 (03CR) 10Btullis: Add a new production images for spark and spark-operator (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/838151 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis)
[17:25:09] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:25:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[17:25:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:26:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[17:26:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35409 and previous config saved to /var/cache/conftool/dbconfig/20221011-172608-ladsgroup.json
[17:26:13] <stashbot>	 T318959: Add fr_user index on flaggedrevs in production - https://phabricator.wikimedia.org/T318959
[17:28:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35410 and previous config saved to /var/cache/conftool/dbconfig/20221011-172822-ladsgroup.json
[17:31:02] <wikibugs>	 (03PS3) 10Dduvall: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy)
[17:31:13] <wikibugs>	 (03PS4) 10Dduvall: Add type Wmflib::POSIX::Name [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy)
[17:31:20] <wikibugs>	 (03CR) 10Dduvall: Add type Wmflib::POSIX::Name (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy)
[17:31:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P35411 and previous config saved to /var/cache/conftool/dbconfig/20221011-173134-ladsgroup.json
[17:31:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenConfirm - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:32:39] <jbond>	 dancy sorry for the delay (errand) but see rz.l sorted things for you now
[17:32:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:32:58] <dancy>	 Yep I'm all set.  
[17:33:08] <jbond>	 cool :)
[17:33:31] <dancy>	 jbond: I do have https://gerrit.wikimedia.org/r/c/operations/puppet/+/841557 which I made a few minutes ago
[17:33:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns4004 to anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/841533 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh)
[17:34:21] <dancy>	 jbond: And we need https://gerrit.wikimedia.org/r/c/operations/puppet/+/841556 deployed too
[17:34:38] * jbond looking
[17:35:45] <sukhe>	 !log running homer "cr*-ulsfo*" commit "Gerrit 841533: sites.yaml: add dns4004 to anycast_neighbors"
[17:35:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:04] <wikibugs>	 (03PS3) 10Jbond: Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:36:11] <wikibugs>	 (03CR) 10Jbond: Restart buildkitd if its config files change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:37:46] <sukhe>	 !log completed homer run for "cr*-ulsfo*" commit 841533
[17:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:37:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Restart buildkitd if its config files change [puppet] - 10https://gerrit.wikimedia.org/r/841557 (https://phabricator.wikimedia.org/T308271) (owner: 10Ahmon Dancy)
[17:38:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/841556 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall)
[17:38:31] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[17:40:12] <jbond>	 dancy: both merged and deployed
[17:40:31] <dancy>	 Thanks! I'm checking w/ Dan to see if anything needs a manual restart as a result.
[17:40:39] <jbond>	 ack
[17:41:41] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ssingh) dns4004 has been commissioned.
[17:41:48] <dancy>	 Looks like restarts are happening automatically.   Thanks for the help jbond and rzl.
[17:42:09] <rzl>	 👍
[17:42:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:43:02] <jbond>	 cool
[17:45:15] <icinga-wm>	 PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[17:46:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T318955)', diff saved to https://phabricator.wikimedia.org/P35412 and previous config saved to /var/cache/conftool/dbconfig/20221011-174641-ladsgroup.json
[17:46:47] <stashbot>	 T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955
[17:47:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/840215 (owner: 10Ahmon Dancy)
[17:48:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[17:57:10] <wikibugs>	 (03PS1) 10Jbond: systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570
[17:58:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P35413 and previous config saved to /var/cache/conftool/dbconfig/20221011-175842-ladsgroup.json
[17:58:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37503/console" [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond)
[17:59:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond)
[18:00:04] <jouncebot>	 dduvall and ^demon: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T1800).
[18:00:49] <sukhe>	 !log sudo gnt-node remove ganeti4001.ulsfo.wmnet
[18:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:33] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti4001.ulsfo.wmnet
[18:01:58] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194)
[18:02:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[18:02:19] <wikibugs>	 (03PS2) 10Jbond: systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570
[18:02:49] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841571 (https://phabricator.wikimedia.org/T314194) (owner: 10TrainBranchBot)
[18:03:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:03:11] <icinga-wm>	 RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9
[18:07:06] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[18:07:07] <logmsgbot>	 !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.5  refs T314194
[18:07:13] <stashbot>	 T314194: 1.40.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T314194
[18:08:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] systemd: Add explicit default for override_filename [puppet] - 10https://gerrit.wikimedia.org/r/841570 (owner: 10Jbond)
[18:09:27] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:09:28] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti4001.ulsfo.wmnet
[18:09:34] <jbond>	   ssh-keygen -f "/home/jbond/.ssh/known_hosts.d/wmf-prod" -R "sretest1002.eqiad.wmnet"
[18:09:35] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `ganeti4001.ulsfo.wmnet` - ganeti4001.ulsfo.wmnet (**PASS**)   - Downtimed host...
[18:10:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: decom ganeti4001 [puppet] - 10https://gerrit.wikimedia.org/r/841553 (https://phabricator.wikimedia.org/T317249) (owner: 10Ssingh)
[18:11:26] <XioNoX>	 !log re-enable cr1-eqiad<->asw2-d-eqiad link for re-cabling - T313463
[18:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:31] <stashbot>	 T313463: eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463
[18:11:49] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10ssingh) @RobH: ganeti4001 has been decommissioned. Thanks!
[18:13:39] <icinga-wm>	 RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[18:13:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318959)', diff saved to https://phabricator.wikimedia.org/P35414 and previous config saved to /var/cache/conftool/dbconfig/20221011-181348-ladsgroup.json
[18:13:53] <stashbot>	 T318959: Add fr_user index on flaggedrevs in production - https://phabricator.wikimedia.org/T318959
[18:14:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[18:16:40] <dcausse>	 !log restarting blazegraph on wdqs1013 (BlazegraphFreeAllocatorsDecreasingRapidly)
[18:16:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:51] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:19:01] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:19:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[18:29:02] <jbond>	 https://github.com/wikimedia/puppet/blob/production/modules/systemd/manifests/unit.pp#L69-L73~.~.~.
[18:29:40] <wikibugs>	 (03PS1) 10Ori: service::docker: allow runtime to be specified [puppet] - 10https://gerrit.wikimedia.org/r/841574 (https://phabricator.wikimedia.org/T316706)
[18:29:42] <wikibugs>	 (03PS1) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706)
[18:32:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706) (owner: 10Ori)
[18:37:07] <wikibugs>	 (03PS2) 10Ori: add profile::docker::gvisor [puppet] - 10https://gerrit.wikimedia.org/r/841575 (https://phabricator.wikimedia.org/T316706)
[18:41:34] <wikibugs>	 (03PS1) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[18:53:02] <wikibugs>	 (03PS1) 10Dduvall: P:gitlab::runner: Enforce Wmflib::POSIX::Variables type for environment [puppet] - 10https://gerrit.wikimedia.org/r/841578
[18:58:05] <wikibugs>	 (03PS2) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[19:00:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[19:11:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney)
[19:11:26] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): eqiad: upgrade row C and D uplinks from 4x10G to 1x40G - https://phabricator.wikimedia.org/T313463 (10cmooney)
[19:11:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney) 05Open→03In progress p:05Triage→03High
[19:11:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney)
[19:14:24] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan)
[19:15:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney)
[19:16:32] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10ayounsi) a:05Jclark-ctr→03None
[19:20:47] <wikibugs>	 (03PS1) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541)
[19:21:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott)
[19:23:23] <wikibugs>	 (03PS2) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541)
[19:29:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Cr1-eqiad comms problem when moving to 40G row D handoff - https://phabricator.wikimedia.org/T320566 (10cmooney)
[19:32:33] <wikibugs>	 (03PS3) 10Jbond: systemd: improve abbility to have addtional overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[19:35:31] <wikibugs>	 (03CR) 10Andrew Bogott: "this is useful if a user wants to script generation of application credentials or similar. i'm not sure it's strictly necessary, we could " [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott)
[19:37:11] <wikibugs>	 (03PS1) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751)
[19:38:06] <wikibugs>	 (03PS2) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751)
[19:39:23] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[19:40:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper)
[19:41:15] <wikibugs>	 (03Abandoned) 10Andrew Bogott: keystone: remove password safelist check from wmtotp auth module [puppet] - 10https://gerrit.wikimedia.org/r/841581 (https://phabricator.wikimedia.org/T320541) (owner: 10Andrew Bogott)
[19:43:20] <wikibugs>	 (03PS3) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751)
[19:44:17] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37505/console" [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper)
[19:45:20] <wikibugs>	 (03PS4) 10Ryan Kemper: [wip] query_service: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751)
[19:46:13] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37506/console" [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper)
[19:46:51] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2001.mgmt.codfw.wmnet with reboot policy FORCED
[19:47:19] <wikibugs>	 (03Abandoned) 10Dduvall: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/835667 (owner: 10PipelineBot)
[19:49:30] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs-test: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751)
[19:49:43] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs-test: try installing nginx w extras [puppet] - 10https://gerrit.wikimedia.org/r/841582 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper)
[19:51:00] <wikibugs>	 10SRE: rsyslog::conf puppet define types inserts an extraneous newline in the content param - https://phabricator.wikimedia.org/T320569 (10jhathaway)
[19:51:09] <wikibugs>	 10SRE: rsyslog::conf puppet define types inserts an extraneous newline in the content param - https://phabricator.wikimedia.org/T320569 (10jhathaway) a:03jhathaway
[19:54:38] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "wdqs-test: try installing nginx w extras" [puppet] - 10https://gerrit.wikimedia.org/r/841518
[19:55:32] <wikibugs>	 (03PS1) 10JHathaway: rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569)
[19:56:41] <wikibugs>	 (03CR) 10Dduvall: P:ci::docker: Install upstream docker packages for all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834399 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[19:57:07] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway)
[19:57:13] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway)
[19:57:53] <wikibugs>	 (03CR) 10Dduvall: P:ci::docker: Upgrade docker to 20.10.18 on all CI agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/834400 (https://phabricator.wikimedia.org/T318382) (owner: 10Dduvall)
[19:58:10] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "wdqs-test: try installing nginx w extras" [puppet] - 10https://gerrit.wikimedia.org/r/841518 (https://phabricator.wikimedia.org/T313751)
[19:58:23] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] "sd" [puppet] - 10https://gerrit.wikimedia.org/r/841518 (https://phabricator.wikimedia.org/T313751) (owner: 10Ryan Kemper)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, and TheresNoTime: Dear deployers, time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221011T2000).
[20:00:05] <jouncebot>	 eigyan: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:27] <TheresNoTime>	 I can deploy! :D
[20:00:33] <urbanecm>	 Go ahead!
[20:00:53] * TheresNoTime waits on eigyan :)
[20:01:24] <wikibugs>	 (03PS2) 10Samtar: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan)
[20:02:28] <eigyan>	 greetings all
[20:02:45] <eigyan>	 o/
[20:03:08] <TheresNoTime>	 eigyan: hi! :)
[20:03:28] <eigyan>	 hey there TheresNoTime :)
[20:03:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan)
[20:04:39] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy the GDI wave 3 survey from PROD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/841551 (https://phabricator.wikimedia.org/T320495) (owner: 10Eigyan)
[20:05:07] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]]
[20:05:11] <stashbot>	 T320495: Undeploy GDI Safety Survey Wave 3 from EN, ES, FR, and PT wikis - https://phabricator.wikimedia.org/T320495
[20:05:31] <logmsgbot>	 !log samtar@deploy1002 samtar and essexigyan: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[20:05:57] <TheresNoTime>	 eigyan: that's live on mwdebug1001, can you test? :)
[20:06:19] <eigyan>	 will do! thank you!
[20:06:50] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Papaul) @bking this  host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk.  Another option is to check also if we have any disk similar...
[20:07:19] <eigyan>	 All is well TheresNoTime
[20:07:27] <TheresNoTime>	 great, syncing 
[20:09:34] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for service wikifunctions.beta.wmflabs.org:443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#wikifunctions.beta.wmflabs.org:443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[20:10:11] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:11:37] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:841551|Undeploy the GDI wave 3 survey from PROD (T320495)]] (duration: 06m 29s)
[20:11:41] <stashbot>	 T320495: Undeploy GDI Safety Survey Wave 3 from EN, ES, FR, and PT wikis - https://phabricator.wikimedia.org/T320495
[20:11:53] <TheresNoTime>	 that's live in production now eigyan, mind checking one last time?
[20:12:05] <eigyan>	 will do!
[20:12:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew)  >  > Also you might like to update this section when convenient: https://wikitech.wikimedia.org/wiki/Dumps/Dump_ser...
[20:13:49] <eigyan>	 We are looking good @there
[20:14:04] <eigyan>	 We are looking good TheresNoTime
[20:14:10] <TheresNoTime>	 eigyan: great, all done then :)
[20:14:30] <eigyan>	 Excellent, as always thanks for all your help:)
[20:15:00] <TheresNoTime>	 you're very welcome :)
[20:16:15] <wikibugs>	 (03PS2) 10JHathaway: rsyslog::conf remove trailing newline logic [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569)
[20:18:23] <TheresNoTime>	 I'll be around for a little while longer if there's any last-minute patches for deployment
[20:23:01] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:25:11] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:25:26] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet
[20:25:55] <mutante>	 !log depooling git-ssh service backends - checking if monitoring will alert
[20:25:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:39] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet
[20:26:52] <TheresNoTime>	 !log close UTC late backport window
[20:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:23] <mutante>	 !log depooling git-ssh service backends with confctl - T296022
[20:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:27] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[20:30:25] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:32:13] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal
[20:32:21] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal
[20:32:32] <mutante>	 ^ yea, that's what I wanted to test once again. the docs claim these are "temporay"
[20:33:02] <mutante>	 and that they would happen when adding new services. but that's not the case here. it's about properly depooling if you only have 1 backend
[20:33:16] <jinxer-wm>	 (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient
[20:33:25] <mutante>	 I am not sure it's possible to do it right
[20:34:43] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal
[20:35:09] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) https://wikitech.wikimedia.org/wiki/PyBal
[20:35:18] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=inactive; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet
[20:35:24] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet
[20:38:16] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:39:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1018 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal
[20:39:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2620:0:861:ed1a::3:16:22, 208.80.154.250:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal
[20:39:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal
[20:39:40] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn trying to decom this service https://wikitech.wikimedia.org/wiki/PyBal
[20:39:41] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/841583 (https://phabricator.wikimedia.org/T320569) (owner: 10JHathaway)
[20:40:21] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:40:57] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:41:20] <icinga-wm>	 ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:41:20] <icinga-wm>	 ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:41:20] <icinga-wm>	 ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:41:20] <icinga-wm>	 ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken daniel_zahn ? https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[20:41:25] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1018 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:52:14] <bblack>	 mutante: hey
[20:52:28] <bblack>	 need help?
[20:54:09] <mutante>	 bblack: I am looking for a way to disable/remove an existing LVS service, but in a way that is still easy to revert and does not cause these alerts 
[20:54:09] <bblack>	 basically, after all the config is deployed, you have to manually remove the final entry from IPVS itself from the CLI
[20:54:31] <mutante>	 it seems I can only do it the right way following https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service
[20:54:36] <bblack>	 if you end up later reverting the patches, the redepoyment will re-provision fine.  but final decom is manual-only.
[20:54:37] <mutante>	 which starts with silencing those alerts
[20:54:44] <mutante>	 there are about 8 alerts though 
[20:54:49] <mutante>	 not just the networking ones
[20:55:06] <mutante>	 4 of them are "Compilation of file '/srv/config-master/pybal/codfw/git-ssh' is broken"
[20:55:33] <mutante>	 even though I have not done anything besides depool. but the special case is there is just one backend
[20:55:55] <wikibugs>	 (03PS1) 10Dduvall: P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694)
[20:56:23] <bblack>	 The "remove a loadbalanced service" thing also seems to kind of assume a "disocovery" service in places
[20:56:28] <logmsgbot>	 !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=phab2001-vcs.codfw.wmnet
[20:56:41] <bblack>	 yeah I don't think a service can legitimately exist without a backend
[20:57:06] <mutante>	 so.. first I just wanted to remove it from DNS. thinking that is still easy to revert if you have to
[20:57:09] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:57:15] <mutante>	 but of course pybal will not like that either
[20:57:17] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[20:57:23] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:57:35] <mutante>	 then I wanted to test again what alerts I really get
[20:57:47] <mutante>	 when I depool the one backend
[20:57:48] <bblack>	 are we removing this fod good?
[20:57:58] <mutante>	 I hope so. yes.
[20:57:59] <bblack>	 git-ssh I mean
[20:58:03] <mutante>	 that's the goal
[20:58:10] <bblack>	 I didn't realize
[20:58:14] <mutante>	 I was just hoping I could just disable it for a week
[20:58:21] <mutante>	 before there are more patches
[20:59:05] <bblack>	 was there some planned phaseout I missed?
[20:59:37] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:59:43] <bblack>	 oh this is maybe not what I was thinking it was
[20:59:49] <bblack>	 I get it now, this is *just* for phab-vcs
[20:59:51] <mutante>	 it's https://phabricator.wikimedia.org/T296022
[21:00:02] <mutante>	 yea, we want to keep gerrit and gitlab
[21:00:05] <bblack>	 for some reason I started thinking this was our gerrit ssh port somehow, indirectly :)
[21:00:08] <mutante>	 but disable repos on phab
[21:00:34] <mutante>	 no, it's just trying to reduce the number of places we have for git repos
[21:00:40] <mutante>	 and at the same time simplify the phab server setup
[21:00:43] <bblack>	 if you just want to disable it, you could leave all this lvs/dns stuff alone and just change the ferm rules on the phab hosts to not allow port 22 from anywhere?
[21:00:53] <bblack>	 although that probably still causes a monitoring alert somewhere to silence
[21:01:40] <mutante>	 hmm. ACK. right now I was bothered by the additional "Compilation of file '/srv/config-master/pybal/eqiad/git-ssh' is broken" type of alerts
[21:01:45] <bblack>	 yeah
[21:01:46] <mutante>	 but looks like 2 of them did go away 
[21:01:51] <bblack>	 I think that's because the service has no backend
[21:01:56] <mutante>	 I kind of remember them from the past too
[21:02:15] <mutante>	 it just takes quite some time until that check realizes changes
[21:02:34] <mutante>	 right now I have pooled the backends in both DCs again
[21:02:38] <bblack>	 ok
[21:03:01] <mutante>	 there were 4 alerts (on puppetmasters) about the templates. now there are 2
[21:03:06] <mutante>	 but still one per DC
[21:05:34] <bblack>	 which alerts?
[21:06:09] <mutante>	 https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster1001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Feqiad%2Fgit-ssh
[21:06:16] <mutante>	 https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster2001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Feqiad%2Fgit-ssh
[21:06:31] <mutante>	 maybe in a minute 
[21:06:42] <bblack>	 that might be one of those ones that persists due to some error-state file
[21:06:45] <bblack>	 hmmm
[21:07:09] <mutante>	 I think I had to delete the error files before
[21:07:16] <mutante>	 and it happened every time I tried this :)
[21:07:42] <mutante>	 then I did it again :p
[21:08:02] <mutante>	 let me try to find the err file
[21:08:37] <wikibugs>	 10SRE-swift-storage, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-Phonos, 10Community-Tech (CommTech-Sprint-35), and 2 others: Phonos links to an unauthorized URL - https://phabricator.wikimedia.org/T317417 (10GMikesell-WMF) @TheresNoTime View a page on the Beta Cluster with a Phonos parser functi...
[21:09:08] <bblack>	 yeah /var/run/confd-template
[21:09:41] <bblack>	 basically: rm -f /var/run/confd-template/.git-ssh*
[21:10:32] <mutante>	 ACK, thanks. those are the ones.
[21:10:54] <mutante>	 !log puppetmaster2001: rm .*.err  in /var/run/confd-template
[21:10:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:00] <mutante>	 just the .err files but same thing
[21:11:27] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[21:12:01] <mutante>	 yea, so, if I really want to disable it for a week, downtiming the checks for _any_ LVS service seems a bit bad
[21:12:52] <mutante>	 !log puppetmaster1001: rm .*.err  in /var/run/confd-template
[21:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:12] <mutante>	 I will look at your suggestion to close port 22
[21:13:23] <mutante>	 and maybe test what alerts then?
[21:13:55] <icinga-wm>	 PROBLEM - SSH on wdqs2005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:14:42] <bblack>	 probably a lot of functional checks will fail, including pybal healthchecks
[21:14:56] <bblack>	 but they should be ackable individually?
[21:15:06] <bblack>	 basically any checks that actually hit that port
[21:16:15] <mutante>	 ok, ack. if all those alerts are specific to my service then that's better
[21:17:08] <mutante>	 or.. I need to remove them from pybal config?
[21:17:14] <mutante>	 and then depool
[21:18:09] <mutante>	 I also wasn't sure if it's a bad idea to remove it from conftool-data if I only touch one of both data centers
[21:18:48] <mutante>	 it should also be ok if I just remove it from conftool-data for both DCs and revert if needed. hopefully it won't be needed 
[21:21:20] <mutante>	 I will stop the sshd service on a backend to see the alert
[21:21:38] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall)
[21:22:22] <mutante>	 !log phab2001 - systemctl stop ssh-phab; temp disable puppet
[21:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:gitlab::runner: Fix buildkitd image ref on WMCS [puppet] - 10https://gerrit.wikimedia.org/r/841584 (https://phabricator.wikimedia.org/T319694) (owner: 10Dduvall)
[21:23:18] <dancy>	 Thx mutante!
[21:23:40] <dduvall>	 yes thank you :)
[21:23:53] <mutante>	 np, cloud-only, heh
[21:25:55] <dancy>	 buildkitd is running on runner-1024.gitlab-runners.eqiad1.wikimedia.cloud now (after I ran run-puppet-agent)
[21:27:09] <icinga-wm>	 PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:27:11] <mutante>	 nice!
[21:27:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:27:53] <mutante>	 bblack: ^ this is the one when I just stop the backend or would firewall it. but yea, I can downtime those.. right
[21:28:25] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:28:31] <mutante>	 waits a couple more minutes for more.. and there it goes
[21:28:50] <mutante>	 this will be on every lvs server.. but just 2 per DC I guess
[21:29:38] <wikibugs>	 (03Abandoned) 10Dduvall: pipeline: Make blubberfile definitions slightly more coherent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708582 (owner: 10Dduvall)
[21:30:00] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal
[21:30:00] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal
[21:30:03] <bblack>	 yup!
[21:30:26] <mutante>	 not using cookbook but good old Icinga web UI to downtime just those and not other stuff on the hosts
[21:30:36] <mutante>	 ack,ty
[21:31:01] <icinga-wm>	 PROBLEM - SSH on mw1319.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:36:53] <mutante>	 !log phab1001 / phab2001 - temp. disabled puppet; stopped ssh-phab service; scheduled icinga downtimes for ssh-phab pybal backend alerts - effectively "soft shutting down" the service - T296022
[21:36:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:58] <stashbot>	 T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022
[21:41:15] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal
[21:41:15] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled: git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled daniel_zahn soft decom https://wikitech.wikimedia.org/wiki/PyBal
[21:41:52] <wikibugs>	 (03PS1) 10Dzahn: phabricator: stop ssh-phab service [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022)
[21:42:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: stop ssh-phab service [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[21:44:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "puppet re-enabled" [puppet] - 10https://gerrit.wikimedia.org/r/841587 (https://phabricator.wikimedia.org/T296022) (owner: 10Dzahn)
[21:47:48] <wikibugs>	 (03CR) 10Ori: "Thanks for this." [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[21:58:05] <wikibugs>	 (03PS4) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[21:59:14] <wikibugs>	 (03PS5) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[22:00:22] <wikibugs>	 (03CR) 10Jbond: systemd::override: Add new helper define for overrides (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:02:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:03:16] <wikibugs>	 (03PS6) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[22:04:44] <wikibugs>	 (03PS7) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[22:06:22] <wikibugs>	 (03PS8) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[22:10:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37507/console" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:10:43] <wikibugs>	 (03CR) 10Jbond: "Its late but i think this should be ready to review, i realised so probably wont get to merge until Thursday but it should be a noop for c" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:11:52] <wikibugs>	 (03CR) 10Jbond: systemd::override: Add new helper define for overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:13:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37508/console" [puppet] - 10https://gerrit.wikimedia.org/r/841577 (owner: 10Jbond)
[22:15:01] <icinga-wm>	 RECOVERY - SSH on wdqs2005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:21:17] <wikibugs>	 (03PS9) 10Jbond: systemd::override: Add new helper define for overrides [puppet] - 10https://gerrit.wikimedia.org/r/841577
[22:32:09] <icinga-wm>	 RECOVERY - SSH on mw1319.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:00:09] <wikibugs>	 10SRE, 10GitLab, 10Infrastructure-Foundations, 10CAS-SSO: migrate gitlab away from the CAS protocol - https://phabricator.wikimedia.org/T320390 (10bd808)
[23:22:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM! PCC noop: https://puppet-compiler.wmflabs.org/pcc-worker1003/37509/" [puppet] - 10https://gerrit.wikimedia.org/r/838833 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)