[00:01:00] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:14] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:13:34] <wikibugs>	 10SRE: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10ZnashBR)
[00:14:48] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s7 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43203.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:21:20] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:25:08] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:30:32] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:26] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[00:43:20] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:47:02] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:55:48] <icinga-wm>	 PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:07:20] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:24:32] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[01:26:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:23:11] <wikibugs>	 (03CR) 10DannyS712: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[02:41:06] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:45:50] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[02:58:22] <icinga-wm>	 RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:02:24] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:11:55] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[03:13:52] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:42] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:18] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:21:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:52] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[04:43:38] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:02:57] <wikibugs>	 10SRE: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10Aklapper) @ZnashBR: Hi and welcome! Can you please elaborate on the Wikimedia Affiliate supporting project and who you have been in contact with?
[05:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:09:30] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:11:48] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:22:36] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 57223 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[05:29:32] <wikibugs>	 (03PS3) 10KartikMistry: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384)
[05:32:51] <wikibugs>	 (03PS1) 10Ayounsi: Rename cloudstore to clouddump [homer/public] - 10https://gerrit.wikimedia.org/r/806026 (https://phabricator.wikimedia.org/T302981)
[05:35:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Rename cloudstore to clouddump [homer/public] - 10https://gerrit.wikimedia.org/r/806026 (https://phabricator.wikimedia.org/T302981) (owner: 10Ayounsi)
[05:39:59] <wikibugs>	 (03PS1) 10Ayounsi: Add cloudstore with clouddumps [homer/public] - 10https://gerrit.wikimedia.org/r/806067 (https://phabricator.wikimedia.org/T302981)
[05:40:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add cloudstore with clouddumps [homer/public] - 10https://gerrit.wikimedia.org/r/806067 (https://phabricator.wikimedia.org/T302981) (owner: 10Ayounsi)
[05:53:24] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0600).
[06:09:30] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:10:02] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:56] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:13:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:15:38] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767)
[06:18:30] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[06:26:10] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[06:32:41] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[06:34:41] <wikibugs>	 (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[06:49:07] <joal>	 !log Rerun webrequest-load-wf-upload-2022-6-15-22 after weird oozie failure 
[06:49:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Amir1 and apergos: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0700).
[07:00:04] <jouncebot>	 kart_ and TheresNoTime: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:09] <apergos>	 good morning! 
[07:00:16] <TheresNoTime>	 apergos: morning! :D
[07:00:22] <apergos>	 we have a trainee signed up although they have not arrived at the gmeet yet
[07:00:39] <apergos>	 kart_:  I imagine you would seld deploy. but. can we coordinate a bit?
[07:01:04] <apergos>	 I'd like to have you screen share the deployment steps while I talk through the process, if you can work with that
[07:01:53] <apergos>	 (in the meantime our trainee did just join the gmeet so that's all good)
[07:02:21] * kart_ is here.
[07:02:35] <apergos>	 see my question to you
[07:03:22] <kart_>	 Message me GMeet link, I can join.
[07:03:28] <apergos>	 ok!
[07:07:58] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] "UTC morning backport deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry)
[07:08:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:08:49] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry)
[07:11:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:12:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:31] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[07:18:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:18:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:08] <wikibugs>	 (03PS1) 10Slyngshede: Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071
[07:22:24] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:805370|testwiki: Enable SectionTranslation for 11 Wikipedias (T309384 T310116)]] (duration: 03m 41s)
[07:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:29] <stashbot>	 T310116: Enable Section Translation in Uzbek Wikipedia - https://phabricator.wikimedia.org/T310116
[07:22:29] <stashbot>	 T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384
[07:23:23] <wikibugs>	 (03CR) 10Jcrespo: "Should probably reference the admin: module on commit topic and Bug:T310524 on the second to last line for better searchability/context?" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (owner: 10Slyngshede)
[07:24:08] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "A little cleanup, using the logging build into the system::timer module." [puppet] - 10https://gerrit.wikimedia.org/r/805829 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:24:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:24:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:28:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:37] <wikibugs>	 (03PS2) 10Slyngshede: M:admin/data/data.yaml Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524)
[07:30:27] <kart_>	 I'm done with deployment @apergos 
[07:30:32] <apergos>	 awesome!
[07:30:59] <apergos>	 if anyone else has a patch and would like to self deploy, now's the time. otherwise I'll wander off in a few minutes
[07:33:44] <TheresNoTime>	 thank you both! :)
[07:34:28] <apergos>	 thank you for showing up and thanks kart_ for being 1/2 of the training as well as deploying!
[07:36:18] <kart_>	 Thank you for joining :)
[07:37:19] <TheresNoTime>	 apergos: should I resolve T305191 or leave it open? I'll be joining the next (few) training sessions regardless :)
[07:37:20] <stashbot>	 T305191: Deployment training request for TheresNoTime - https://phabricator.wikimedia.org/T305191
[07:38:06] <apergos>	 TheresNoTime: just mark that you did it and let Tyler close I think
[07:38:35] <apergos>	 you can (and we like it if you) come to many more trainings, regardless of the task being closed. and then eventually...
[07:38:43] <apergos>	 after you've been deploying for awhile...
[07:38:53] <apergos>	 you start helping to give these trainings :-)  
[07:39:01] <apergos>	 (our secret plan is now revealed!)
[07:40:10] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10SLyngshede-WMF) p:05Triage→03Medium
[07:42:40] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10SLyngshede-WMF) p:05Triage→03High
[07:44:52] <wikibugs>	 (03CR) 10Jcrespo: "Thank you. Note just "admin:" before the subject should be enough (the "module" name) 0:-). See the example at: https://www.mediawiki.org/" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede)
[07:46:33] <TheresNoTime>	 apergos: \o/ you were very good at the training fwiw, been doing it a while?
[07:47:13] <apergos>	 I have, I used to do other sorts of trainings in the organizer/activist realm and so I know some things about doing trainings from that experience :-)  thanks!
[07:47:26] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:47:50] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:05] <wikibugs>	 (03CR) 10Volans: sre.hosts.pxe: Cookbook to configure dhcp option82 and reboot into pxe (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[07:50:56] <icinga-wm>	 PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:53:10] <icinga-wm>	 RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:58:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Enforce alert names with no spaces [alerts] - 10https://gerrit.wikimedia.org/r/805393 (owner: 10Filippo Giunchedi)
[07:58:56] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mediawiki chart 0.2.3: Add before-hook-creation hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy)
[08:11:58] <wikibugs>	 (03CR) 10Tacsipacsi: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[08:12:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede)
[08:13:05] <wikibugs>	 (03PS3) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734
[08:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:22:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734 (owner: 10Muehlenhoff)
[08:23:26] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] M:admin/data/data.yaml Fix LDAP / Puppet mismatch for cmyrick [puppet] - 10https://gerrit.wikimedia.org/r/806071 (https://phabricator.wikimedia.org/T310524) (owner: 10Slyngshede)
[08:24:00] <wikibugs>	 (03CR) 10David Caro: Create REST api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[08:24:50] <wikibugs>	 (03CR) 10Awight: "(I think this has the wrong bug number)" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite)
[08:25:56] <icinga-wm>	 PROBLEM - Check systemd state on ml-cache1001 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:40] <elukey>	 this is wip --^
[08:33:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, very minor nits inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[08:35:52] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "LGTM, thanks for the patch. Just one safety check to add." [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond)
[08:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[08:36:54] <wikibugs>	 (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[08:36:56] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: drop REPLICATE 'access log' from container-server [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171)
[08:39:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, modulo what Awight said" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite)
[08:39:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760) (owner: 10Cwhite)
[08:40:13] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, once the pre-requisite patches have been merged feel free to start testing it." [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[08:45:19] <moritzm>	 !log failover ganeti master in drmrs/2 to ganeti6004
[08:45:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, won't work as-is" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[08:48:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi)
[08:48:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: use SafeLoader for team regexes [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805383 (owner: 10Filippo Giunchedi)
[08:48:43] <wikibugs>	 (03PS5) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276
[08:49:08] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[08:49:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[08:49:52] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:25] <godog>	 taavi: I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/802104 ok ?
[08:51:31] <taavi>	 sure!
[08:51:36] <taavi>	 godog: ^
[08:52:09] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet
[08:52:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:00] <godog>	 ack
[08:53:03] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: use hostname for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847)
[08:53:05] <wikibugs>	 (03PS3) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847)
[08:53:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[08:53:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[08:53:22] <godog>	 {{done}}
[08:53:26] <godog>	 taavi: ^
[08:54:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use hostname for blackbox::check::http (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:54:21] <taavi>	 thanks
[08:55:45] <godog>	 sure np
[08:56:42] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:59:24] <wikibugs>	 (03PS1) 10Elukey: Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232)
[08:59:26] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede)
[09:00:56] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:02:45] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti6002.drmrs.wmnet
[09:02:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:00] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[09:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:09:12] <wikibugs>	 (03CR) 10Jbond: "Seems fine but we should clean up the exports vhost at the same time" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[09:11:02] <wikibugs>	 (03PS2) 10Elukey: Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232)
[09:11:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:11:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29868 and previous config saved to /var/cache/conftool/dbconfig/20220616-091131-marostegui.json
[09:11:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:37] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[09:12:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add stub cassandra tls secrets for the ml-cache cluster [labs/private] - 10https://gerrit.wikimedia.org/r/806167 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[09:14:48] <wikibugs>	 (03PS8) 10Slyngshede: WIP: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[09:15:02] <wikibugs>	 (03PS1) 10Elukey: role::ml_cache::storage: add TLS settings for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232)
[09:15:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35884/console" [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[09:15:59] <icinga-wm>	 PROBLEM - Disk space on centrallog2002 is CRITICAL: DISK CRITICAL - free space: /srv 61174 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[09:16:29] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_cache::storage: add TLS settings for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/806168 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[09:16:32] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:17:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[09:18:22] <icinga-wm>	 RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:19:32] <wikibugs>	 (03PS2) 10Jbond: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807
[09:21:31] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1002.eqiad.wmnet with OS buster
[09:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:08] <icinga-wm>	 PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:24:31] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] Prometheus: scrap Netbox django metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[09:26:40] <wikibugs>	 (03PS10) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[09:26:42] <wikibugs>	 (03PS25) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[09:27:47] <wikibugs>	 (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[09:27:58] <wikibugs>	 (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:29:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:30:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[09:30:54] * jayme ❤️ pylint
[09:30:58] <wikibugs>	 (03PS1) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[09:30:58] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[09:31:51] <wikibugs>	 (03CR) 10Jbond: SREBaseClass: Allow overriding actions (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond)
[09:32:04] <wikibugs>	 (03PS11) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[09:32:06] <wikibugs>	 (03PS26) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[09:32:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] Prometheus: scrap Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[09:32:28] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1003.eqiad.wmnet with OS buster
[09:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:22] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[09:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:56] <wikibugs>	 (03PS2) 10JMeybohm: Align cumin aliases for wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/790662 (https://phabricator.wikimedia.org/T260661)
[09:36:00] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1002.eqiad.wmnet with reason: host reimage
[09:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:04] <wikibugs>	 (03CR) 10Jbond: "JKuyst noticed i forgot to" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:36:34] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[09:37:42] <wikibugs>	 (03PS53) 10David Caro: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[09:39:32] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:39:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:40:57] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928)
[09:40:59] <wikibugs>	 (03PS2) 10Ayounsi: Prometheus: gently pull Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928)
[09:41:33] <wikibugs>	 (03CR) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[09:41:44] <wikibugs>	 (03CR) 10Ayounsi: Prometheus: gently pull Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[09:42:52] <wikibugs>	 (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:44:41] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage
[09:44:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:57] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1003.eqiad.wmnet with reason: host reimage
[09:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:13] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:49:41] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:56:59] <godog>	 jbond: re: the phab p a g e yesterday, that's icinga not prometheus that pages
[09:57:16] <jbond>	 godog: ack thanks i noticed but too late :)
[09:57:47] <godog>	 hehe! I'm looking at the phab probes now though, definitely better with the hostname
[09:57:53] <godog>	 i.e. https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=http&orgId=1&from=now-3h&to=now&var-site=All
[09:58:01] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "Seems good to me, thanks; once we're done with the bullseye upgrade, might worth seeing if swift has some knobs to twiddle to make it a bi" [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[09:58:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] swift: drop REPLICATE 'access log' from container-server [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[09:58:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Cheers Matthew, merging" [puppet] - 10https://gerrit.wikimedia.org/r/806166 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[10:00:05] <jouncebot>	 mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1000).
[10:02:36] <elukey>	 !log ran `scap install-world --batch` on deploy1002 to allow scap/puppet to work on ml-cache100[2,3]
[10:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:08] <jbond>	 godog: definetly, delay was wondering and checking wht we only had phab1001 but its the only one with monitoring configured
[10:06:29] <wikibugs>	 (03CR) 10Jbond: "LGTM just minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[10:06:49] <icinga-wm>	 RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:07:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805448 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[10:08:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1003.eqiad.wmnet with OS buster
[10:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[10:11:30] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1002.eqiad.wmnet with OS buster
[10:11:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:31] <icinga-wm>	 PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:16:21] <icinga-wm>	 RECOVERY - Check systemd state on ml-cache1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171)
[10:16:41] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.130.9:7001 on ml-cache1001 is OK: SSL OK - Certificate ml-cache1001-a valid until 2024-06-15 08:50:14 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[10:16:45] <icinga-wm>	 RECOVERY - cassandra-a service on ml-cache1001 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:17:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[10:17:33] <godog>	 jbond: yeah I went with the existing guard for the active host only, though that should be revisited IMHO (in a future iteration)
[10:18:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s7 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:18:45] <wikibugs>	 (03PS2) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171)
[10:18:45] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:18:56] <jbond>	 godog: agree
[10:20:05] <wikibugs>	 (03PS1) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518)
[10:21:33] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483?
[10:21:35] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl1001.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483?
[10:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35885/console" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[10:23:01] <godog>	 hah, phab on ipv6 is failing because envoy isn't listening on :443 on ipv6
[10:23:05] <godog>	 "fair enough"
[10:24:35] <godog>	 ok gotta go to lunch!
[10:24:57] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:25:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Bump changelog for 6.5.5 and add some docs how to resync the overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518)
[10:25:57] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:20] <elukey>	 -7
[10:28:22] <elukey>	 uff
[10:28:53] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483
[10:28:54] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ml-serve-ctrl1002.eqiad.wmnet with reason: Rebooting to activate new kernel for T310483
[10:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:13] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is OK: TCP OK - 0.000 second response time on 10.64.130.9 port 9042 https://phabricator.wikimedia.org/T93886
[10:31:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29869 and previous config saved to /var/cache/conftool/dbconfig/20220616-103117-marostegui.json
[10:31:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1089.eqiad.wmnet
[10:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:21] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[10:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:57] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:33:17] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:34:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet
[10:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1089.eqiad.wmnet
[10:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: reboots
[10:36:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: reboots
[10:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet
[10:37:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:11] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond)
[10:41:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet
[10:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I like it, see small nit inline for the naming, and yes might require some additional changes elsewhere." [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[10:41:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond)
[10:44:05] <wikibugs>	 (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[10:45:11] <icinga-wm>	 RECOVERY - Check systemd state on netflow5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:17] <wikibugs>	 (03Merged) 10jenkins-bot: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 (owner: 10Jbond)
[10:45:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet
[10:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet
[10:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29870 and previous config saved to /var/cache/conftool/dbconfig/20220616-104622-marostegui.json
[10:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on elastic[1100-1102].eqiad.wmnet with reason: reboots
[10:46:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic[1100-1102].eqiad.wmnet with reason: reboots
[10:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:50] <wikibugs>	 (03PS1) 10Volans: sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177
[10:49:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet
[10:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:33] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:53:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet
[10:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:01] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1001.eqiad.wmnet
[10:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:56:00] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[10:56:46] <wikibugs>	 (03PS2) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[10:57:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet
[10:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:50] <wikibugs>	 (03CR) 10Jbond: "thanks updated" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[10:58:26] <wikibugs>	 (03CR) 10Jbond: sre.hosts.pxe: Cookbook to configure dhcp option82 and reboot into pxe (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[11:00:25] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[11:00:34] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1001.eqiad.wmnet
[11:00:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P29871 and previous config saved to /var/cache/conftool/dbconfig/20220616-110127-marostegui.json
[11:01:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:02:22] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1002.eqiad.wmnet
[11:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:50] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:03] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:29] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1002.eqiad.wmnet
[11:07:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:43] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:09:05] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1003.eqiad.wmnet
[11:09:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:10:29] <wikibugs>	 (03PS9) 10Slyngshede: profile::aptrepo::wikimedia test public apt repo on Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506
[11:12:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet
[11:12:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:14:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:15:04] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35886/console" [puppet] - 10https://gerrit.wikimedia.org/r/803506 (owner: 10Slyngshede)
[11:15:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, see comment inline too." [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[11:16:05] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:16:21] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1003.eqiad.wmnet
[11:16:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29873 and previous config saved to /var/cache/conftool/dbconfig/20220616-111632-marostegui.json
[11:16:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:16:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:38] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[11:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet
[11:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:31] <klausman>	 SOrry about the BGP error noise. I'm rebooting ml k8s nodes for new kernels and that triggers it. I could put in a silence but it seems there are other BGP alerts that I might step on
[11:18:38] <wikibugs>	 (03PS3) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171)
[11:18:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: introduce rsyslog config to ban logs before centrallog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[11:19:04] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1004.eqiad.wmnet
[11:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:59] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:20:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet
[11:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:20:55] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM, thanks :-)" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[11:22:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet
[11:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/806173 (https://phabricator.wikimedia.org/T309171) (owner: 10Filippo Giunchedi)
[11:25:26] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1004.eqiad.wmnet
[11:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:27] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1005.eqiad.wmnet
[11:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:01] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195
[11:31:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove rsync config only needed for stretch->bullseye migration [puppet] - 10https://gerrit.wikimedia.org/r/804339 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[11:32:07] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670
[11:32:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond)
[11:32:59] <wikibugs>	 (03PS2) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195
[11:33:10] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1005.eqiad.wmnet
[11:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on testvm[2001-2005].codfw.wmnet with reason: reboots
[11:34:45] <wikibugs>	 (03PS3) 10Jbond: DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195
[11:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on testvm[2001-2005].codfw.wmnet with reason: reboots
[11:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:57] <wikibugs>	 (03CR) 10Jbond: [V: 04-1 C: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond)
[11:35:05] <wikibugs>	 (03CR) 10Jbond: [V: 04-1 C: 04-2] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond)
[11:35:21] <godog>	 !log trim swift logs older than 25d from centrallog hosts - T309171
[11:35:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:25] <stashbot>	 T309171: syslog / centrallog log volume growth - https://phabricator.wikimedia.org/T309171
[11:36:34] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:36:43] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:37:02] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:37:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DO NOT MERGE: try to reproduce an issue [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond)
[11:37:58] <wikibugs>	 (03PS10) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897)
[11:38:17] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1006.eqiad.wmnet
[11:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:50] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:40:00] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:40:18] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:41:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[11:43:46] <wikibugs>	 (03PS18) 10Btullis: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[11:43:55] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "We also need to keep everything in src/main/resources (theses are our skinning customisations)" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[11:44:36] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1006.eqiad.wmnet
[11:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:18] <wikibugs>	 (03PS11) 10Hnowlan: cassandra: load grants files upon change [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897)
[11:45:29] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1007.eqiad.wmnet
[11:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:55] <wikibugs>	 (03CR) 10Muehlenhoff: cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[11:48:19] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806174 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[11:50:18] <icinga-wm>	 RECOVERY - Disk space on centrallog2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog2002&var-datasource=codfw+prometheus/ops
[11:50:20] <wikibugs>	 (03CR) 10Jbond: Bump changelog for 6.5.5 and add some docs how to resync the overlay (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806175 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[11:50:22] <icinga-wm>	 PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:51:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans)
[11:52:07] <wikibugs>	 (03CR) 10Hnowlan: cassandra: load grants files upon change (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[11:52:28] <wikibugs>	 (03PS1) 10Slyngshede: P:apt do not include private apt repo on cloud hosts. [puppet] - 10https://gerrit.wikimedia.org/r/806197
[11:52:30] <icinga-wm>	 RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:53:03] <wikibugs>	 (03CR) 10Btullis: "I'd be grateful for a review of this please. The idea is to be able to have real-time information in Prometheus about which servers are su" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[11:53:11] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1007.eqiad.wmnet
[11:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:39] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-serve1008.eqiad.wmnet
[11:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:26] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35887/console" [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede)
[11:55:56] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:56:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] SREBatchBase: Make action method a bit more dynamic (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[11:56:59] <wikibugs>	 (03PS3) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[11:58:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede)
[11:58:44] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:apt do not include private apt repo on cloud hosts. [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede)
[11:58:51] <wikibugs>	 (03CR) 10Muehlenhoff: P:apt do not include private apt repo on cloud hosts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806197 (owner: 10Slyngshede)
[11:59:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 for schema change', diff saved to https://phabricator.wikimedia.org/P29874 and previous config saved to /var/cache/conftool/dbconfig/20220616-115924-root.json
[11:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:40] <wikibugs>	 (03PS1) 10Slyngshede: hiera:cloud fix comma [puppet] - 10https://gerrit.wikimedia.org/r/806198
[12:00:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806198 (owner: 10Slyngshede)
[12:01:07] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve1008.eqiad.wmnet
[12:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[12:01:44] <klausman>	 done with reboots for now. ANy remaining BGP alerts are, like, for real
[12:01:58] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] hiera:cloud fix comma [puppet] - 10https://gerrit.wikimedia.org/r/806198 (owner: 10Slyngshede)
[12:02:50] <wikibugs>	 (03CR) 10Jbond: [V: 04-1 C: 04-2] DO NOT MERGE: try to reproduce an issue (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/806195 (owner: 10Jbond)
[12:11:05] <wikibugs>	 (03PS1) 10Muehlenhoff: cas: Update to 6.5.5 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518)
[12:15:18] <wikibugs>	 (03PS1) 10Btullis: Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649)
[12:15:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670 (owner: 10Muehlenhoff)
[12:16:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Hm, what effect will this change have? As far as I can tell from WikibaseCirrusSearch code, this doesn’t look like a no-op…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[12:16:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:17:23] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35888/console" [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[12:19:01] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[12:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:20:46] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:22:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans)
[12:24:41] <wikibugs>	 (03PS6) 10Jbond: wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342
[12:24:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[12:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: sre.cdn.roll-restart-varnish: simplify code [cookbooks] - 10https://gerrit.wikimedia.org/r/806177 (owner: 10Volans)
[12:26:39] <wikibugs>	 (03PS4) 10Vlad.shapik: WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719)
[12:27:02] <wikibugs>	 (03PS1) 10Slyngshede: C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206
[12:27:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WP:Upgrade thumbor to Thumbor 7 and python3 [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/800170 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[12:28:45] <wikibugs>	 (03PS4) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847)
[12:28:47] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[12:28:49] <wikibugs>	 (03PS1) 10Filippo Giunchedi: phabricator: get envoy to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847)
[12:29:23] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::service: add data loader class [puppet] - 10https://gerrit.wikimedia.org/r/799342 (owner: 10Jbond)
[12:29:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35889/console" [puppet] - 10https://gerrit.wikimedia.org/r/806207 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:29:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede)
[12:31:17] <wikibugs>	 (03PS19) 10Jbond: Add initial config for pooled status [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[12:31:53] <wikibugs>	 (03PS2) 10Slyngshede: C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206
[12:33:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:33:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[12:33:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29875 and previous config saved to /var/cache/conftool/dbconfig/20220616-123357-marostegui.json
[12:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:03] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[12:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[12:36:20] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35890/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[12:37:13] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35891/console" [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede)
[12:40:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "couple of nits but lgtm, please get a +1 from observability to check the prom file.  Full output of which can be seen in the full diff" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[12:42:08] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:45:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede)
[12:46:20] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:apt actively absent privte repo if not requested. [puppet] - 10https://gerrit.wikimedia.org/r/806206 (owner: 10Slyngshede)
[12:46:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[12:48:39] <wikibugs>	 (03PS12) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[12:48:49] <wikibugs>	 (03PS27) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[12:52:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Align cumin aliases for wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/790662 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[12:55:36] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a new check for the age of the standby namenode fsimage [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[12:56:29] <wikibugs>	 10SRE-swift-storage, 10Commons: HTTP 503 Backend fetch failed while editing Commons - https://phabricator.wikimedia.org/T307338 (10Aklapper)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1300).
[13:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:01:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4004.ulsfo.wmnet
[13:01:02] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-be
[13:01:02] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=varnish-fe
[13:01:03] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-tls
[13:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[13:02:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[13:04:25] <wikibugs>	 (03PS1) 10Jbond: wmflib::service: Reject empty string values [puppet] - 10https://gerrit.wikimedia.org/r/806208
[13:04:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29876 and previous config saved to /var/cache/conftool/dbconfig/20220616-130438-root.json
[13:04:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[13:05:26] <wikibugs>	 (03Merged) 10jenkins-bot: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[13:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:06:42] <wikibugs>	 (03CR) 10MVernon: "Looks reasonable to me, but I don't feel I know enough to weigh in on whether the validator should be pickier or not." [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan)
[13:07:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4004.ulsfo.wmnet
[13:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:43] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:09:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[13:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:28] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4004.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[13:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet
[13:10:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:21] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:14:02] <wikibugs>	 (03PS1) 10JMeybohm: Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210
[13:15:28] <wikibugs>	 (03CR) 10Ottomata: "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/806205 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[13:19:18] <wikibugs>	 (03PS1) 10Btullis: Add a sudo_user parameter to the hadoop fsimage freshness check [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649)
[13:19:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29877 and previous config saved to /var/cache/conftool/dbconfig/20220616-131942-root.json
[13:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35893/console" [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[13:21:47] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.misc-clusters.sretest rolling restart_daemons on A:sretest
[13:21:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:00] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.misc-clusters.sretest (exit_code=0) rolling restart_daemons on A:sretest
[13:22:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:52] <wikibugs>	 (03PS20) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[13:24:03] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp1089.eqiad.wmnet
[13:24:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:05] <wikibugs>	 (03PS1) 10Zabe: imagemagick: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806216 (https://phabricator.wikimedia.org/T308013)
[13:25:07] <wikibugs>	 (03PS1) 10Zabe: php: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806217 (https://phabricator.wikimedia.org/T308013)
[13:25:09] <wikibugs>	 (03PS1) 10Zabe: spamassassin: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806218 (https://phabricator.wikimedia.org/T308013)
[13:25:11] <wikibugs>	 (03PS1) 10Zabe: tomcat: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806219 (https://phabricator.wikimedia.org/T308013)
[13:25:13] <wikibugs>	 (03PS1) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013)
[13:25:45] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a sudo_user parameter to the hadoop fsimage freshness check [puppet] - 10https://gerrit.wikimedia.org/r/806212 (https://phabricator.wikimedia.org/T309649) (owner: 10Btullis)
[13:27:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:27:38] <wikibugs>	 (03PS2) 10Zabe: vrts: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/806220 (https://phabricator.wikimedia.org/T308013)
[13:30:34] <wikibugs>	 (03CR) 10Itamar Givon: [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:31:45] <wikibugs>	 (03PS1) 10Volans: MW DB user: update username to wikiuser [puppet] - 10https://gerrit.wikimedia.org/r/806221
[13:33:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] MW DB user: update username to wikiuser [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans)
[13:35:17] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [cirrus] Add a custom profile for the wikibase language selector (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801793 (https://phabricator.wikimedia.org/T307869) (owner: 10DCausse)
[13:35:47] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222
[13:36:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans)
[13:37:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29878 and previous config saved to /var/cache/conftool/dbconfig/20220616-133446-root.json
[13:37:58] <wikibugs>	 (03PS21) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[13:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:07] <wikibugs>	 (03PS2) 10Volans: MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221
[13:38:23] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222
[13:39:32] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:40:39] <wikibugs>	 (03PS1) 10Volans: MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223
[13:41:05] <wikibugs>	 (03CR) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:41:15] <wikibugs>	 (03PS22) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[13:44:01] <wikibugs>	 (03CR) 10Jbond: "Sorry missed this one earlier" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:45:50] <sukhe>	 !log upload bird2_2.0.7-4.1wm1 to apt.wm.o (buster) - T310574
[13:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:54] <stashbot>	 T310574: Upgrade to Bird 2 - https://phabricator.wikimedia.org/T310574
[13:46:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans)
[13:47:12] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm)
[13:47:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm)
[13:49:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM but see comment, no action needed but something to consider as we deploy" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[13:49:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29879 and previous config saved to /var/cache/conftool/dbconfig/20220616-134950-root.json
[13:49:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:36] <wikibugs>	 (03PS4) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[13:50:47] <wikibugs>	 (03PS5) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[13:50:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/806170 (owner: 10Jbond)
[13:50:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update misc-clusters/example.txt... [cookbooks] - 10https://gerrit.wikimedia.org/r/806210 (owner: 10JMeybohm)
[13:51:46] <wikibugs>	 (03PS23) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[13:51:55] <wikibugs>	 (03CR) 10Volans: [C: 03+2] MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans)
[13:54:00] <wikibugs>	 (03PS1) 10Jbond: log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225
[13:54:22] <wikibugs>	 (03Merged) 10jenkins-bot: MW DB user: update username to wikiuser202206 [software] - 10https://gerrit.wikimedia.org/r/806223 (owner: 10Volans)
[13:56:26] <wikibugs>	 (03PS6) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[13:56:33] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond)
[13:56:47] <wikibugs>	 (03PS7) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[13:57:07] <wikibugs>	 (03PS8) 10Jbond: SREBatchBase: Make action method a bit more dynamic [cookbooks] - 10https://gerrit.wikimedia.org/r/806170
[13:57:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond)
[13:57:46] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:58:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:58:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:18] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35894/console" [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:58:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[13:58:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:58:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:14] <icinga-wm>	 PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:16] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:59:28] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:01:08] <logmsgbot>	 !log volans@cumin1001 dbctl commit (dc=all): 'Doesn't have new wikiuser', diff saved to https://phabricator.wikimedia.org/P29880 and previous config saved to /var/cache/conftool/dbconfig/20220616-140107-volans.json
[14:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:19] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] aptrepo: add repository component for bird2 [puppet] - 10https://gerrit.wikimedia.org/r/805448 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[14:02:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:02:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:30] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:03:58] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:04:00] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:04:10] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:04:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans)
[14:04:38] <wikibugs>	 (03CR) 10Volans: [C: 03+2] MW DB user: update username to wikiuser202206 [puppet] - 10https://gerrit.wikimedia.org/r/806221 (owner: 10Volans)
[14:04:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29881 and previous config saved to /var/cache/conftool/dbconfig/20220616-140453-root.json
[14:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29882 and previous config saved to /var/cache/conftool/dbconfig/20220616-140613-marostegui.json
[14:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:18] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[14:06:57] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[14:07:03] <wikibugs>	 (03PS24) 10Btullis: Add a host's confctl pooled status and weight per service to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246)
[14:07:30] <wikibugs>	 (03Merged) 10jenkins-bot: log: stop suppressing logging exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/806225 (owner: 10Jbond)
[14:09:10] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:17] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[14:21:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29883 and previous config saved to /var/cache/conftool/dbconfig/20220616-142118-marostegui.json
[14:21:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:36] <icinga-wm>	 PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604927 seconds, message: migration to webperf2004, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:26:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460)
[14:27:04] <icinga-wm>	 PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has been disabled for 605108 seconds, message: migration to webperf1004, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[14:29:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P29884 and previous config saved to /var/cache/conftool/dbconfig/20220616-142923-ladsgroup.json
[14:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-be
[14:29:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=varnish-fe
[14:29:26] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet,service=ats-tls
[14:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P29885 and previous config saved to /var/cache/conftool/dbconfig/20220616-143623-marostegui.json
[14:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:43:28] <wikibugs>	 10SRE-OnFire, 10Wikidata, 10wdwb-tech, 10wmde-team-a-tech, and 4 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE)
[14:44:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 50%: Maint done', diff saved to https://phabricator.wikimedia.org/P29886 and previous config saved to /var/cache/conftool/dbconfig/20220616-144427-ladsgroup.json
[14:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:52] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I've run out of time to work on this for now, so I'm removing the #data-engineering-kanba...
[14:45:30] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:51:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302659)', diff saved to https://phabricator.wikimedia.org/P29887 and previous config saved to /var/cache/conftool/dbconfig/20220616-145128-marostegui.json
[14:51:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[14:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:31] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[14:51:32] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[14:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29888 and previous config saved to /var/cache/conftool/dbconfig/20220616-145136-marostegui.json
[14:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] Add profile::mediawiki::sharded_periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[14:58:57] <wikibugs>	 (03PS4) 10Majavah: Separate metricsinfra nodes from prometheus_nodes on cloud [puppet] - 10https://gerrit.wikimedia.org/r/795143
[14:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P29889 and previous config saved to /var/cache/conftool/dbconfig/20220616-145931-ladsgroup.json
[14:59:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:07] <wikibugs>	 (03PS2) 10Ayounsi: wmf-netbox: simplify interface description for circuits [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591)
[15:00:53] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Reenable U2F for now [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff)
[15:00:56] <wikibugs>	 (03CR) 10Muehlenhoff: cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[15:01:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch old Stretch arclamp nodes to role::insetup until eventual decom [puppet] - 10https://gerrit.wikimedia.org/r/804341 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff)
[15:02:04] <wikibugs>	 (03CR) 10Ayounsi: "Example diff ran locally:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) (owner: 10Ayounsi)
[15:03:53] <wikibugs>	 (03CR) 10Itamar Givon: [cirrus] Fix typo in config var (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801792 (owner: 10DCausse)
[15:05:48] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:22] <icinga-wm>	 RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:26] <icinga-wm>	 RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:08:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/795143 (owner: 10Majavah)
[15:10:58] <icinga-wm>	 RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:11:08] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:13:40] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[15:14:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1128 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P29890 and previous config saved to /var/cache/conftool/dbconfig/20220616-151434-ladsgroup.json
[15:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) 05Open→03Resolved This is complete.
[15:16:15] <wikibugs>	 (03PS1) 10Btullis: Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079)
[15:22:29] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[15:23:08] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10Papaul) 05Open→03Resolved a:03Papaul Icinga is show on green on the raid check  `    MD RAID    This service is currently in a period of scheduled downtime View Extra Service Notes  OK  2022-06-16 15:16:31  2d 2h...
[15:25:52] <wikibugs>	 (03Merged) 10jenkins-bot: Update the container image used by DataHub 0.8.38 [deployment-charts] - 10https://gerrit.wikimedia.org/r/806232 (https://phabricator.wikimedia.org/T310079) (owner: 10Btullis)
[15:26:18] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update the trafficserver rule for datahub [puppet] - 10https://gerrit.wikimedia.org/r/805331 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:27:21] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[15:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:38] <wikibugs>	 (03PS4) 10Clare Ming: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683)
[15:28:39] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[15:28:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[15:29:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:30] <icinga-wm>	 PROBLEM - AQS root url on aqs2012 is CRITICAL: connect to address 10.192.48.189 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:29:32] <icinga-wm>	 PROBLEM - Check systemd state on aqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:36] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:02] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[15:30:04] <icinga-wm>	 PROBLEM - AQS root url on aqs2009 is CRITICAL: connect to address 10.192.48.186 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:18] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:22] <icinga-wm>	 PROBLEM - Check systemd state on aqs2004 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:30] <icinga-wm>	 PROBLEM - AQS root url on aqs2006 is CRITICAL: connect to address 10.192.16.168 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:30:34] <icinga-wm>	 PROBLEM - Check systemd state on aqs2010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:34] <icinga-wm>	 PROBLEM - AQS root url on aqs2005 is CRITICAL: connect to address 10.192.16.42 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:30:35] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[15:30:36] <icinga-wm>	 PROBLEM - Check systemd state on aqs2012 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:30] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[15:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:55] <btullis>	 ---^ These AQS alerts should have been in downtime I believe, as that cluster on codfw is still being set up. I will check.
[15:33:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29891 and previous config saved to /var/cache/conftool/dbconfig/20220616-153320-marostegui.json
[15:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:25] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[15:35:04] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 269.39 ms
[15:35:46] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 0%, RTA = 668.78 ms
[15:38:24] <icinga-wm>	 PROBLEM - AQS root url on aqs2011 is CRITICAL: connect to address 10.192.48.188 and port 7232: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS%23Monitoring
[15:38:26] <wikibugs>	 (03Abandoned) 10Samtar: Raise $wgAutoblockExpiry from 1 day to 3 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767912 (https://phabricator.wikimedia.org/T43479) (owner: 10Samtar)
[15:42:04] <wikibugs>	 (03PS1) 10Krinkle: Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747)
[15:47:53] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2066 - https://phabricator.wikimedia.org/T309595 (10Papaul) 05Open→03Resolved i checked with @MatthewVernon on irc he said: `  yeah, that was a consequence of changing the SSDs in that box from RAID-0 to non-RAID, it's OK to close that task ` so we are good to clos...
[15:48:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29892 and previous config saved to /var/cache/conftool/dbconfig/20220616-154825-marostegui.json
[15:48:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:25] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) a:05ayounsi→03RobH Ideally I'd like DCops to take care of link/interface level problems. I'm happy to help if needed though.
[15:51:18] <icinga-wm>	 PROBLEM - Host lvs2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:56:10] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:57:18] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:45] <wikibugs>	 (03PS2) 10Cwhite: logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T310760)
[16:00:05] <jouncebot>	 jbond and rzl: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:06] <papaul>	 lvs2009 was me sorry about that 
[16:00:54] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] "Thanks for catching that!" [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T310760) (owner: 10Cwhite)
[16:02:50] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10odimitrijevic)
[16:03:04] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:03:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P29893 and previous config saved to /var/cache/conftool/dbconfig/20220616-160330-marostegui.json
[16:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:08] <icinga-wm>	 RECOVERY - Host lvs2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.55 ms
[16:07:11] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH)
[16:07:31] <wikibugs>	 10SRE, 10Data-Engineering, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata) a:05Jelto→03None
[16:08:57] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, and 2 others: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10JArguello-WMF)
[16:13:45] <wikibugs>	 (03CR) 10Jbond: Add a host's confctl pooled status and weight per service to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/776225 (https://phabricator.wikimedia.org/T300246) (owner: 10Btullis)
[16:14:00] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10Ottomata)
[16:18:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[16:18:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302659)', diff saved to https://phabricator.wikimedia.org/P29894 and previous config saved to /var/cache/conftool/dbconfig/20220616-161835-marostegui.json
[16:18:38] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:18:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[16:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:42] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[16:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29895 and previous config saved to /var/cache/conftool/dbconfig/20220616-161844-marostegui.json
[16:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, might be worth a pcc just to make sure" [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff)
[16:19:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] cas: Update to 6.5.5 (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/806203 (https://phabricator.wikimedia.org/T305518) (owner: 10Muehlenhoff)
[16:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:24:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[16:28:57] <wikibugs>	 (03PS1) 10Majavah: perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308)
[16:29:18] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308) (owner: 10Majavah)
[16:30:35] <wikibugs>	 (03Merged) 10jenkins-bot: perl: add libfile-slurp-perl package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806242 (https://phabricator.wikimedia.org/T305308) (owner: 10Majavah)
[16:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[16:37:20] <wikibugs>	 (03CR) 10Ayounsi: Prometheus: gently pull Netbox django metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[16:38:37] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] "PCC failure was:" [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff)
[16:43:23] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Worked on the email draft with Arzhel and just emailed it in CC'd both Arzhel and Cathal.  Once I have more info I'll update this ticket.
[16:47:34] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Q3 2018/19 Goal: TEC6: Build automated workflows for server provisioning  (Tracking Task) - https://phabricator.wikimedia.org/T213114 (10ayounsi)
[16:47:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) 05Open→03Resolved a:03ayounsi Finally time to close this task.  We've added more things to Netbox since, but no need for a tracking task anymore.  Tracking core sites pow...
[16:52:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29896 and previous config saved to /var/cache/conftool/dbconfig/20220616-165210-marostegui.json
[16:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:16] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[16:53:06] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[16:53:48] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:57:32] <wikibugs>	 (03Abandoned) 10Marostegui: mariadb: Change wikiuser user [software] - 10https://gerrit.wikimedia.org/r/806222 (owner: 10Marostegui)
[16:58:50] <volans>	 mr1-eqsin should be the scheduled maintenance of eqsin's provider for one power feed
[17:00:05] <jouncebot>	 brennen and thcipriani: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1700).
[17:00:57] <brennen>	 o/ - but this may or may not be happening.  continuing to hold the time in case we figure out what we're doing.
[17:02:03] <wikibugs>	 (03PS1) 10Majavah: Provide a nodejs16 image based on Bullseye and Nodesource [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/806266 (https://phabricator.wikimedia.org/T310821)
[17:05:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:07:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29897 and previous config saved to /var/cache/conftool/dbconfig/20220616-170715-marostegui.json
[17:07:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P29898 and previous config saved to /var/cache/conftool/dbconfig/20220616-172220-marostegui.json
[17:22:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:20] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:26:42] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[17:26:44] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[17:26:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:12] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: bug fix
[17:27:13] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: bug fix
[17:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:31] <wikibugs>	 (03PS1) 10Dzahn: langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777)
[17:31:24] <wikibugs>	 (03PS1) 10Dzahn: langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776)
[17:31:25] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix
[17:31:27] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: bug fix
[17:31:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302659)', diff saved to https://phabricator.wikimedia.org/P29899 and previous config saved to /var/cache/conftool/dbconfig/20220616-173725-marostegui.json
[17:37:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:37:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[17:37:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:30] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:37:31] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[17:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[17:37:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29900 and previous config saved to /var/cache/conftool/dbconfig/20220616-173738-marostegui.json
[17:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "lol @ "gently pull". LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[17:41:00] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH)
[17:41:35] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) Arelion support case 01418061​ to investigate things.  I'll followup with them as they progress the case.
[17:42:18] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab.wmfusercontent.org with reason: bug fix
[17:42:20] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab.wmfusercontent.org with reason: bug fix
[17:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:43:10] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[17:43:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus: gently pull Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[17:46:13] <brennen>	 !log starting phabricator deploy, momentary downtime expected while services restart
[17:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:21] <TheresNoTime>	 \o/
[17:51:13] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] opensearch: ensure elasticsearch-curator on opensearch compatible fork [puppet] - 10https://gerrit.wikimedia.org/r/803587 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite)
[17:52:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:54:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gitlab_runner: Allow subdirs in image paths [puppet] - 10https://gerrit.wikimedia.org/r/805247 (https://phabricator.wikimedia.org/T310535) (owner: 10Brennen Bearnes)
[17:58:44] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:59:14] <brennen>	 !log end of phabricator deploy
[17:59:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <jouncebot>	 brennen and jeena: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T1800).
[18:01:06] <brennen>	 wheeeee
[18:01:39] <brennen>	 i'm gonna take 5 here, deploy a backport, then go ahead to all wikis.
[18:02:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:05:36] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:06:40] <mutante>	 what a schedule and timing there, brennen. kudos
[18:06:52] <mutante>	 :59
[18:06:56] <brennen>	 :D
[18:10:00] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main
[18:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29901 and previous config saved to /var/cache/conftool/dbconfig/20220616-181005-marostegui.json
[18:10:07] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[18:10:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:10] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[18:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:16] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main
[18:10:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:20] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[18:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:18] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 217.14 ms
[18:11:27] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main
[18:11:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:06] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.93 ms
[18:12:12] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[18:12:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:36] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: sync on main
[18:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:19] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[18:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:00] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747) (owner: 10Krinkle)
[18:17:36] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:20:25] <wikibugs>	 (03PS1) 10AOkoth: install_server: remove gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/806273 (https://phabricator.wikimedia.org/T307142)
[18:21:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] install_server: remove gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/806273 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth)
[18:25:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29902 and previous config saved to /var/cache/conftool/dbconfig/20220616-182510-marostegui.json
[18:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:22] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn)
[18:25:42] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10serviceops: eventstreams chart should use latest common_templates - https://phabricator.wikimedia.org/T310721 (10SLyngshede-WMF) p:05Triage→03Medium
[18:25:50] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn)
[18:26:46] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[18:27:36] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[18:29:30] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.eqiad.wmnet
[18:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:07] <wikibugs>	 (03Merged) 10jenkins-bot: Only try to create User object if username is not null [extensions/CheckUser] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/806246 (https://phabricator.wikimedia.org/T310747) (owner: 10Krinkle)
[18:34:00] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.88 ms
[18:38:02] <wikibugs>	 (03PS1) 10AOkoth: install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142)
[18:38:12] <wikibugs>	 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) 05Open→03Resolved All merged. Thanks! 🎉
[18:38:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth)
[18:39:00] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:39:44] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 217.20 ms
[18:40:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P29903 and previous config saved to /var/cache/conftool/dbconfig/20220616-184015-marostegui.json
[18:40:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:33] <TheresNoTime>	 oh thcipriani while I remember, ref T305191, the next training session it might be good to be walked through a deploy - can I get access to do that?
[18:40:34] <stashbot>	 T305191: Deployment training request for TheresNoTime - https://phabricator.wikimedia.org/T305191
[18:40:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:42:06] <logmsgbot>	 !log brennen@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/CheckUser/src/Hooks.php: Backport: [[gerrit:806246|Only try to create User object if username is not null (T310747)]] (duration: 03m 23s)
[18:42:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:12] <stashbot>	 T310747: TypeError: Argument 1 passed to MediaWiki\User\UserFactory::newFromName() must be of the type string, null given - https://phabricator.wikimedia.org/T310747
[18:44:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:44:27] <brennen>	 !log train 1.39.0-wmf.16 (T308069): no current blockers - rolling to all wikis
[18:44:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:33] <stashbot>	 T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069
[18:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:07] <wikibugs>	 (03PS1) 10Brennen Bearnes: all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069)
[18:45:09] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069) (owner: 10Brennen Bearnes)
[18:45:11] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] install_server: remove gitlab-runner 2001 [puppet] - 10https://gerrit.wikimedia.org/r/806276 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth)
[18:45:53] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806278 (https://phabricator.wikimedia.org/T308069) (owner: 10Brennen Bearnes)
[18:48:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:49:57] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.16  refs T308069
[18:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:02] <stashbot>	 T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069
[18:50:16] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[18:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:18] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[18:53:19] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.eqiad.wmnet
[18:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:10] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.eqiad.wmnet
[18:54:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:54:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302659)', diff saved to https://phabricator.wikimedia.org/P29904 and previous config saved to /var/cache/conftool/dbconfig/20220616-185520-marostegui.json
[18:55:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:24] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[18:57:22] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[18:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:30] <wikibugs>	 (03PS1) 10AOkoth: site: remove old gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142)
[18:57:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm but only merge after cookbook is done" [puppet] - 10https://gerrit.wikimedia.org/r/806279 (https://phabricator.wikimedia.org/T307142) (owner: 10AOkoth)
[19:00:54] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:00:55] <logmsgbot>	 !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gitlab-runner1001.eqiad.wmnet
[19:00:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:53] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab-runner2001.codfw.wmnet
[19:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:28] <wikibugs>	 10SRE: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn)
[19:11:16] <wikibugs>	 10SRE: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn)
[19:11:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn)
[19:19:56] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:20:30] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:23:11] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.dns.netbox
[19:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) After this I ran only the DNS cookbook directly and this time it finished without such an error. I am not sure if it tried though because it said...
[19:24:14] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.544 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:24:50] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48250 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:26:03] <wikibugs>	 (03PS1) 10JMeybohm: Allow to dry-run SREBatchRunnerBase [cookbooks] - 10https://gerrit.wikimedia.org/r/806285
[19:26:05] <wikibugs>	 (03PS1) 10JMeybohm: SREBatchBase: Fix broken batchsize argument [cookbooks] - 10https://gerrit.wikimedia.org/r/806286
[19:26:07] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.reboot-nodes: Fix errors identified during dry-run [cookbooks] - 10https://gerrit.wikimedia.org/r/806287 (https://phabricator.wikimedia.org/T260661)
[19:26:09] <wikibugs>	 (03PS1) 10JMeybohm: sre.k8s.reboot-node: Dynamically adjust batchsize [cookbooks] - 10https://gerrit.wikimedia.org/r/806288 (https://phabricator.wikimedia.org/T260661)
[19:30:54] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:36:29] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn)
[19:36:50] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn)
[19:38:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) The run of the decom book was at:  `2022-06-16 18:54:09,812 dzahn 2165070 [INFO] START - Cookbook sre.hosts.decommission for hosts gitlab-runner10...
[19:39:38] <logmsgbot>	 !log aokoth@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[19:39:39] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner2001.codfw.wmnet
[19:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:56] <wikibugs>	 (03CR) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[19:49:07] <RhinosF1>	 mutante: +1'd both I saw in my email
[19:52:16] <mutante>	 RhinosF1: thank you, ACK
[19:53:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Arnoldokoth) fatal: unable to access 'https://netbox1002.eqiad.wmnet/dns.git/': The requested URL returned error: 403 0.0% (0/1) success ratio (< 100.0%...
[19:56:26] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to contint-admins for taavi - https://phabricator.wikimedia.org/T309375 (10thcipriani) >>! In T309375#7963808, @Dzahn wrote: > checked off boxes (L3 signed, NDA, has existing shell access, etc). >  >  > Will need approval from group approver (Tyler).  @hashar and...
[19:58:39] <wikibugs>	 (03PS4) 10DannyS712: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433
[19:58:43] <wikibugs>	 (03CR) 10DannyS712: CommonSettings: clean up and simplify some code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[20:00:05] <jouncebot>	 brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T2000).
[20:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:00] <wikibugs>	 (03PS5) 10DannyS712: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433
[20:01:36] <DannyS712>	 brennen also me, added my patch a few seconds too late
[20:01:53] <DannyS712>	 (i.e. I also have patches scheduled for the deployment window)
[20:01:59] <TheresNoTime>	 tsk /j
[20:03:35] <cjming>	 o/
[20:03:37] <cjming>	 i'll deploy
[20:04:41] <wikibugs>	 (03PS3) 10DannyS712: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115)
[20:05:03] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming)
[20:06:26] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off TOC A/B test for pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805179 (https://phabricator.wikimedia.org/T309683) (owner: 10Clare Ming)
[20:10:38] <wikibugs>	 (03PS4) 10Clare Ming: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:11:16] <cjming>	 hi DannyS712: doing your patches here next
[20:12:14] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:805179|Turn off TOC A/B test for pilot wikis (T309683)]] (duration: 03m 37s)
[20:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:18] <stashbot>	 T309683: Turn off table of contents A/B test - https://phabricator.wikimedia.org/T309683
[20:13:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:19] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:14:49] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:14:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:14:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:59] <DannyS712>	 ^ another patch I'm going to add for the current window
[20:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:14] <wikibugs>	 (03Merged) 10jenkins-bot: phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:16:00] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:46] <cjming>	 DannyS712: going to sync your 1st patch since it's comments
[20:19:23] <DannyS712>	 okay, then I have two more that aren't comments
[20:19:39] <DannyS712>	 sorry, 3 more
[20:20:39] <wikibugs>	 (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:20:42] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:23:30] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/: Config: [[gerrit:805432|phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (T171115)]] (duration: 03m 22s)
[20:23:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:35] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[20:23:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:26] <wikibugs>	 (03PS6) 10Clare Ming: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[20:24:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:24:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:24:41] <thcipriani>	 DannyS712: we might run out of time in this window, but we're looking. Scap is a little bit slower since we're doing PHP restarts for every deploy.
[20:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:01] <DannyS712>	 okay, no rush
[20:25:05] <thcipriani>	 <3
[20:25:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:23] <logmsgbot>	 !log cjming@deploy1002 Synchronized phpcs.xml: Config: [[gerrit:805432|phpcs: move SpaceBeforeSingleLineComment.NewLineComment exclusions (T171115)]] (duration: 03m 27s)
[20:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:45] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250
[20:31:27] <wikibugs>	 (03PS2) 10Dzahn: Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250
[20:32:04] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:33:40] <wikibugs>	 (03Merged) 10jenkins-bot: phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806248 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:34:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "Provide buildkitd to GitLab runners"" [puppet] - 10https://gerrit.wikimedia.org/r/806250 (owner: 10Dzahn)
[20:34:49] <thcipriani>	 DannyS712: I fetched your function rename one down to mwdebug1002, if you want to take a look
[20:35:13] <DannyS712>	 not really sure where I can test that
[20:36:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[20:36:17] <DannyS712>	 so what should I do?
[20:37:30] <thcipriani>	 How do you mean? You're not sure what to test there? Or not sure what part of the front end exercises that since it happens before hitting mwcore?
[20:37:57] <DannyS712>	 both - not sure what to test or where to test it
[20:38:05] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway)
[20:38:12] <wikibugs>	 (03PS3) 10JHathaway: exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873)
[20:38:28] <wikibugs>	 (03CR) 10JHathaway: [V: 03+2] exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway)
[20:39:58] <thcipriani>	 DannyS712: ok, I'll make sure nothing explodes on the backend and sync
[20:40:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:40:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:41:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:41:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:01] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Config: [[gerrit:806248|phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions (T171115)]]
[20:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:04] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[20:42:08] <wikibugs>	 (03Abandoned) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[20:42:33] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] MWRealm.php: remove unused getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:42:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:17] <wikibugs>	 (03Merged) 10jenkins-bot: MWRealm.php: remove unused getRealmSpecificFilename() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806249 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[20:43:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway)
[20:44:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: [mitigated] Google returning 503 error when delivering to mx1001 and mx2001 - https://phabricator.wikimedia.org/T307873 (10jhathaway) 05Open→03Resolved This has now been fixed upstream, https://git.exim.org/exim.git/commit/462e2cd30. We w...
[20:45:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway)
[20:45:46] <wikibugs>	 (03CR) 10Cwhite: "scap.announce will be the first stream to go to Loki if releng approves." [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite)
[20:46:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Wikimedia-Incident: 2022-05-09 Exim BDAT Errors incident - https://phabricator.wikimedia.org/T309238 (10jhathaway)
[20:46:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: Upgrade Exim to 4.96 - https://phabricator.wikimedia.org/T310836 (10jhathaway) 05Open→03Stalled This is stalled until 4.96 is available in Debian.
[20:47:44] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:47:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:48:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:49:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:20] <DannyS712>	 thcipriani will https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/805433 be included in this window?
[20:51:42] <DannyS712>	 CommonSettings cleanup
[20:52:20] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[20:52:24] <thcipriani>	 DannyS712: sure :)
[20:53:53] <wikibugs>	 (03PS1) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[20:54:01] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: clean up and simplify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[20:54:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[20:55:15] <wikibugs>	 (03PS2) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[20:56:56] <wikibugs>	 (03PS1) 10BCornwall: Traffic: Port over purged lag/queue monitors [alerts] - 10https://gerrit.wikimedia.org/r/806332 (https://phabricator.wikimedia.org/T300723)
[20:58:35] <wikibugs>	 (03PS1) 10Hnowlan: Port Dockerfile to use buster [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333
[20:58:59] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Config: [[gerrit:806248|phpcs: enable PrefixedGlobalFunctions.allowedPrefix and rename functions (T171115)]] (duration: 16m 57s)
[20:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:03] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[20:59:10] <thcipriani>	 ok, well that's syncd
[20:59:28] <DannyS712>	 still need to sync CommonSettings though, right?
[20:59:40] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Scap, 10serviceops, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10Krinkle)
[21:00:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[21:00:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:06] <wikibugs>	 (03CR) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[21:00:20] <thcipriani>	 DannyS712: yeah, we still have your last two to go, does that sound right to you?
[21:00:45] <thcipriani>	 they're both on mwdebug1002 if there's anything to check for either of those
[21:01:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[21:01:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[21:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:01:40] <DannyS712>	 nothing to check for removing an unused function, and the common settings should be a no-op, so I think should be good to sync
[21:02:13] <thcipriani>	 OK: syncing mwrealm.php then commonsettings.php
[21:02:40] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:03:00] <icinga-wm>	 PROBLEM - Check systemd state on elastic2054 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_6@production-search-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[21:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:12] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:06:29] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized multiversion/MWRealm.php: Config: [[gerrit:806249|MWRealm.php: remove unused getRealmSpecificFilename() (T171115)]] (duration: 03m 35s)
[21:06:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:34] <stashbot>	 T171115: Remove phpcs exceptions and severity 0 from mediawiki-config - https://phabricator.wikimedia.org/T171115
[21:06:37] <thcipriani>	 next one going live now
[21:07:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Port Dockerfile to use buster [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/806333 (owner: 10Hnowlan)
[21:10:55] <logmsgbot>	 !log thcipriani@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:805433|CommonSettings: clean up and simplify some code]] (duration: 03m 42s)
[21:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:07] <thcipriani>	 ^ DannyS712 all done!
[21:11:55] <DannyS712>	 thanks for the deployments!
[21:12:13] <thcipriani>	 sure thing, thanks for making code better :)
[21:12:35] <wikibugs>	 (03PS1) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271)
[21:13:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[21:14:09] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: noop test
[21:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:42] <wikibugs>	 (03PS2) 10Dzahn: docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271)
[21:18:16] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: noop test (duration: 04m 07s)
[21:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:42] <wikibugs>	 (03CR) 10Dzahn: "arr.. Could not find resource 'Service[docker]'" [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[21:18:50] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] docker::network: refresh service docker after adding a docker network [puppet] - 10https://gerrit.wikimedia.org/r/806341 (https://phabricator.wikimedia.org/T308271) (owner: 10Dzahn)
[21:19:31] <thcipriani>	 brennen: no-op was 4m07s so there was some unsync'd localization change lurking somewhere
[21:19:49] <brennen>	 yeah, makes sense.
[21:19:55] <thcipriani>	 which is...kinda worrisome
[21:20:16] <brennen>	 possibly we oughta make scap say _why_ it's doing a cdb rebuild
[21:22:53] <thcipriani>	 huh
[21:23:30] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:10] <dancy>	 brennen: Scap doesn't know.  It's the mediawiki maintenance script that does the deed.
[21:33:40] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:34:26] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:35:18] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:07] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:07] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:07] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service daniel_zahn deployment in progress - needs manual steps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:04] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:58] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:40:38] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:04] <icinga-wm>	 RECOVERY - Check systemd state on elastic2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:48:22] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:53:52] <wikibugs>	 (03PS3) 10Dzahn: gitlab::runner: set sysctl kernel.unprivileged_userns_clone = 1 [puppet] - 10https://gerrit.wikimedia.org/r/806327 (https://phabricator.wikimedia.org/T308271)
[22:03:00] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:07:32] <wikibugs>	 10SRE, 10Maps: Allow Wikimedia Maps usage on desciclopedia.org - https://phabricator.wikimedia.org/T310761 (10ZnashBR) >>! In T310761#8008054, @Aklapper wrote: > @ZnashBR: Hi and welcome! Can you please elaborate on the Wikimedia Affiliate supporting project and who you have been in contact with?  Sorry, i'm n...
[22:09:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) p:05Triage→03High It seems the vhost has changed:  ` root@netbox2002:~# runuser -u netbox -- git -C "/srv/netbox-exports/dns.git" fetch -v ne...
[22:19:50] <wikibugs>	 (03CR) 10Dzahn: "might have caused https://phabricator.wikimedia.org/T310831" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[22:20:36] <wikibugs>	 (03PS1) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251
[22:20:55] <wikibugs>	 (03PS2) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831)
[22:22:42] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:23:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans)
[22:24:25] <wikibugs>	 (03PS3) 10Volans: Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831)
[22:26:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to deployment for TheresNoTime - https://phabricator.wikimedia.org/T302231 (10thcipriani) 05Stalled→03Open >  - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml  Approved! @TheresNoTime is attending [[...
[22:26:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans)
[22:27:36] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Revert "Netbox: expose Netbox on the frontend's FQDN" [puppet] - 10https://gerrit.wikimedia.org/r/806251 (https://phabricator.wikimedia.org/T310831) (owner: 10Volans)
[22:31:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) Run puppet on both netbox hosts (1002/2002)
[22:32:02] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:33:17] <logmsgbot>	 !log volans@cumin2002 START - Cookbook sre.dns.netbox
[22:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:24] <wikibugs>	 (03PS1) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826)
[22:36:10] <wikibugs>	 (03PS3) 10Cwhite: logstash: duplicate scap.announce logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/804484 (https://phabricator.wikimedia.org/T222826)
[22:37:04] <logmsgbot>	 !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:37:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) Run dns cookbook to force sync the data everywhere (the last couple of commits where not deployed to the authdns hosts). The procedure is describ...
[22:39:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Volans) 05Open→03Resolved a:03Volans This should have fixed the issue. I'm resolving it, but feel free to re-open in case it's not fully solved.
[22:39:36] <wikibugs>	 (03PS1) 10Andrew Bogott: haproxy/nova-api-metadata use the /healthcheck endpoint for health check [puppet] - 10https://gerrit.wikimedia.org/r/806350
[22:41:20] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:43:29] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10Krinkle) This reminds me of T140664, which is a proposal from a few years ago going in a similar direction (albeit for a different use case).  In any event, establishing such a router wi...
[22:49:34] <icinga-wm>	 RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:50:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox: DNS cookbook failed syncing with netbox - 403 from netbox1002 - https://phabricator.wikimedia.org/T310831 (10Dzahn) Thank you for the very quick response!
[22:52:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] langlist: add blk, Pa'O language [dns] - 10https://gerrit.wikimedia.org/r/806267 (https://phabricator.wikimedia.org/T310777) (owner: 10Dzahn)
[22:53:10] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[22:56:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] langlist: add pcm, Nigerian Pidgin language [dns] - 10https://gerrit.wikimedia.org/r/806268 (https://phabricator.wikimedia.org/T310776) (owner: 10Dzahn)
[22:59:10] <mutante>	 !log new Wikipedia languages added to DNS:  blk = https://en.wikipedia.org/wiki/Pa%27O_language  |  pcm = https://en.wikipedia.org/wiki/Nigerian_Pidgin
[22:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:50] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10RobH) a:05RobH→03ayounsi @ayounsi,  So as you can see they advised they want us to go and investigate the cross-connect, and if they result in charges we'll use that thread to get a credit on our Arelion...
[23:04:16] <wikibugs>	 (03PS2) 10Tim Starling: Fix unsupported $wgLogos default configurations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[23:08:34] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:14:16] <wikibugs>	 (03CR) 10Tim Starling: "Really needs a +1 from Tyler." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/806068 (https://phabricator.wikimedia.org/T310767) (owner: 10Thiemo Kreuz (WMDE))
[23:18:51] <wikibugs>	 10SRE, 10Data-Engineering-Icebox, 10Traffic-Icebox: We are not capturing IPs of original requests for proxied requests from operamini and googleweblight. x-forwarded-for is null and client-ip is the same as IP on Webrequest data - https://phabricator.wikimedia.org/T232795 (10odimitrijevic)
[23:34:42] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:36:45] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS bullseye
[23:36:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:36:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye
[23:38:03] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[23:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:16] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1016.eqiad.wmnet with reason: host reimage
[23:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:22] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1016.eqiad.wmnet with OS bullseye
[23:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host aqs1016.eqiad.wmnet with OS bullseye completed: - aqs1016 (**WARN**)...