[00:13:37] <icinga-wm>	 PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-07 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:19:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:25:45] <icinga-wm>	 PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-07 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:25:45] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:36:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:41:40] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:42:59] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:04:38] <wikibugs>	 (03PS1) 10Krinkle: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532)
[01:04:56] <wikibugs>	 (03PS1) 10Krinkle: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532)
[01:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:06:57] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:59] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:34:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:05] <icinga-wm>	 PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:59] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:35] <icinga-wm>	 RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-06-14 00:00:01 (3153 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:52:22] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle)
[01:52:28] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle)
[01:57:59] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:58:41] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:59:37] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:59:39] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-06-14 00:00:02 (3132 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:00:51] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:01:45] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:09:17] <wikibugs>	 (03Merged) 10jenkins-bot: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle)
[02:09:49] <wikibugs>	 (03Merged) 10jenkins-bot: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle)
[02:11:39] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:12:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[02:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[02:13:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:17:24] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.15/includes/cache/MessageCache.php: T310532 (duration: 03m 29s)
[02:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:17:29] <stashbot>	 T310532: Investigate McRouter GET request spike from wmf.15 - https://phabricator.wikimedia.org/T310532
[02:19:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:21:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:21:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:24:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:25:37] <logmsgbot>	 !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.16/includes/cache/MessageCache.php: (no justification provided) (duration: 03m 36s)
[02:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:30:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:45:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:48:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:00] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:55:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:01:55] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:03] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[03:18:13] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[03:19:35] <icinga-wm>	 PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:20:59] <icinga-wm>	 PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:03] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:59] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:59] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:05] <icinga-wm>	 RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[03:45:39] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:46:01] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[04:04:01] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:10:45] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:17:15] <icinga-wm>	 RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:20:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:22:01] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:24:03] <icinga-wm>	 PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:51] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:45] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:01] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:35:01] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:53] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:39:07] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:41:57] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:51:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:53:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1173.eqiad.wmnet with OS bullseye
[04:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:47] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:00:01] <icinga-wm>	 PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:01:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:03:17] <marostegui>	 !log Reboot dbproxy1016 and dbproxy1021 T310484
[05:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:03:21] <stashbot>	 T310484: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484
[05:04:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage
[05:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:06:30] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui)
[05:07:05] <wikibugs>	 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) 05Open→03Resolved All done ` ===== NODE GROUP ===== (12) dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1017,1020-1021].eqiad.wmnet ----- OUTPUT of 'sudo uname -v' ----- #1 SMP Debian 5.10...
[05:07:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage
[05:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:36] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:11:42] <wikibugs>	 (03PS1) 10Marostegui: es2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805503 (https://phabricator.wikimedia.org/T310485)
[05:12:06] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805503 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui)
[05:14:06] <wikibugs>	 (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805504 (https://phabricator.wikimedia.org/T310485)
[05:14:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:15:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805504 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui)
[05:16:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:17:10] <marostegui>	 !log dbmaint es1@codfw T310485
[05:17:12] <marostegui>	 !log dbmaint es2@codfw T310485
[05:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:14] <marostegui>	 !log dbmaint es3@codfw T310485
[05:17:15] <marostegui>	 !log dbmaint es4@codfw T310485
[05:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:17] <marostegui>	 !log dbmaint es5@codfw T310485
[05:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:18:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:19:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:21:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) p:05High→03Medium Thank you so much @Cmjohnson! I can indeed access the host now and I have reimaged it sucessfully. Decreasing the priority since the initial issue was triaged (so fast!). So once t...
[05:23:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1173.eqiad.wmnet with OS bullseye
[05:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[05:34:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[05:34:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[05:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[05:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29745 and previous config saved to /var/cache/conftool/dbconfig/20220615-054252-marostegui.json
[05:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:57] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[05:49:10] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:12] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:53:36] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:59:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:00:30] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on db2084 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:00:35] <marostegui>	 ^ me
[06:00:52] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on db2080 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:00:56] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s8 on db2152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:01:44] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on db2084 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:01:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[06:01:48] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance
[06:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29746 and previous config saved to /var/cache/conftool/dbconfig/20220615-060153-marostegui.json
[06:01:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:57] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:02:12] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on db2152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:02:35] <marostegui>	 !log Reboot db[2071-2078] T310485
[06:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:46] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:03:32] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s8 on db2080 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:04:12] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:06:04] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:09:09] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805437
[06:10:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805437 (owner: 10Marostegui)
[06:10:32] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:11:00] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[06:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[06:12:16] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:27] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805438
[06:13:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805438 (owner: 10Marostegui)
[06:14:02] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:14:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:22:43] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:55] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:24:09] <icinga-wm>	 PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:26:23] <icinga-wm>	 PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:27:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:30:51] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:11] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:05] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29747 and previous config saved to /var/cache/conftool/dbconfig/20220615-063837-marostegui.json
[06:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:44] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[06:42:05] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:46] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:58] <XioNoX>	 !log disable BGP to Telia in eqsin for optic replacement - T300485
[06:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:03] <stashbot>	 T300485: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485
[06:53:28] <jynus>	 a prometheus job will complain temporarilly while I reboot the bacula director
[06:53:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29748 and previous config saved to /var/cache/conftool/dbconfig/20220615-065342-marostegui.json
[06:53:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:57] <jynus>	 bacula metrics should be back up
[06:57:08] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T0700). Please do the needful.
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:05:28] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:07:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:08:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29749 and previous config saved to /var/cache/conftool/dbconfig/20220615-070847-marostegui.json
[07:08:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:58] <icinga-wm>	 PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer,excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:11:12] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:12:24] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:10] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29750 and previous config saved to /var/cache/conftool/dbconfig/20220615-071728-root.json
[07:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P29751 and previous config saved to /var/cache/conftool/dbconfig/20220615-072034-root.json
[07:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29752 and previous config saved to /var/cache/conftool/dbconfig/20220615-072352-marostegui.json
[07:23:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:23:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[07:23:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:58] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:31:25] <wikibugs>	 (03PS3) 10Slyngshede: Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055)
[07:32:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:32:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29753 and previous config saved to /var/cache/conftool/dbconfig/20220615-073232-root.json
[07:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:34:30] <icinga-wm>	 PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:35:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29754 and previous config saved to /var/cache/conftool/dbconfig/20220615-073538-root.json
[07:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:32] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:39:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:39:11] <wikibugs>	 (03PS2) 10KartikMistry: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384)
[07:43:12] <icinga-wm>	 RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:43:19] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10SLyngshede-WMF) p:05Triage→03Low
[07:44:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[07:44:08] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:44:22] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10SLyngshede-WMF) p:05Low→03Medium
[07:46:10] <wikibugs>	 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10SLyngshede-WMF) p:05Triage→03Medium
[07:47:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29755 and previous config saved to /var/cache/conftool/dbconfig/20220615-074736-root.json
[07:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:02] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:13] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055) (owner: 10Slyngshede)
[07:49:36] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266)
[07:50:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[07:50:20] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance
[07:50:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29756 and previous config saved to /var/cache/conftool/dbconfig/20220615-075024-marostegui.json
[07:50:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:50:28] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[07:50:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29757 and previous config saved to /var/cache/conftool/dbconfig/20220615-075042-root.json
[07:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:38] <wikibugs>	 (03PS3) 10Slyngshede: Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy)
[07:53:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy)
[07:53:48] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy)
[07:54:34] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩️): Allow deployers to sudo -u mwpresync - https://phabricator.wikimedia.org/T310654 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03High a:03SLyngshede-WMF
[07:57:10] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:57:41] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Triage→03Medium
[07:58:04] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10SLyngshede-WMF) p:05Triage→03Medium
[07:59:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 244 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:00:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[08:01:28] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:02:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:02:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805733
[08:02:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734
[08:02:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29758 and previous config saved to /var/cache/conftool/dbconfig/20220615-080240-root.json
[08:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:27] <XioNoX>	 !log re-enable BGP to Telia in eqsin for optic replacement - T300485
[08:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:30] <stashbot>	 T300485: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485
[08:03:44] <icinga-wm>	 PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:05:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29759 and previous config saved to /var/cache/conftool/dbconfig/20220615-080546-root.json
[08:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:07:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:09:01] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: only run CSV dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[08:09:17] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM, unless we want to ensure that the redundant files are removed first." [puppet] - 10https://gerrit.wikimedia.org/r/805734 (owner: 10Muehlenhoff)
[08:10:44] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: an-tool1009, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001
[08:10:44] <icinga-wm>	 -fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[08:12:24] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good, but I think we can just remove the custom parser as well." [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff)
[08:12:42] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:14:02] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:24] <wikibugs>	 (03PS1) 10Awight: Remove $wgVisualEditorTransclusionDialogBackButton feature flag [extensions/Cite] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805748 (https://phabricator.wikimedia.org/T307188)
[08:17:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29760 and previous config saved to /var/cache/conftool/dbconfig/20220615-081744-root.json
[08:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:00] <wikibugs>	 (03PS1) 10Jaime Nuche: scap: remove config for scap Debian package [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559)
[08:20:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:20:38] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:20:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29761 and previous config saved to /var/cache/conftool/dbconfig/20220615-082050-root.json
[08:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:04] <wikibugs>	 (03PS12) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[08:21:18] <wikibugs>	 (03CR) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[08:22:14] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.9.3" for 557 hosts
[08:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:34] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.9.3" completed for 557 hosts
[08:22:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:47] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.9.3" for 557 hosts
[08:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:07] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.9.3" completed for 557 hosts
[08:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:04] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:26:52] <wikibugs>	 (03CR) 10Muehlenhoff: Retire profile::logster_alarm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff)
[08:27:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29762 and previous config saved to /var/cache/conftool/dbconfig/20220615-082734-marostegui.json
[08:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:39] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[08:28:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:29:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:32] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I can confirm. 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712)
[08:32:42] <icinga-wm>	 PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:06] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:34:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:35:13] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Medium→03High
[08:35:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:35:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29763 and previous config saved to /var/cache/conftool/dbconfig/20220615-083554-root.json
[08:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:04] <wikibugs>	 10SRE, 10Thumbor, 10Traffic: Thumbor URLs are too permissive - https://phabricator.wikimedia.org/T310528 (10SLyngshede-WMF) p:05Triage→03Medium
[08:38:14] <wikibugs>	 10SRE, 10Traffic, 10serviceops: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10SLyngshede-WMF) p:05Triage→03Medium
[08:39:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:21] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Approach looks good, some minor things to fix inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[08:40:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:40:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[08:40:42] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29764 and previous config saved to /var/cache/conftool/dbconfig/20220615-084046-marostegui.json
[08:40:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:51] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[08:41:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29765 and previous config saved to /var/cache/conftool/dbconfig/20220615-084151-marostegui.json
[08:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:12] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): "Personally, I find this one of the more questionable rules we have. The benefit is small, especially when I use an IDE that clearly separa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712)
[08:42:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29766 and previous config saved to /var/cache/conftool/dbconfig/20220615-084239-marostegui.json
[08:42:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:50] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: remove Icinga checks for netbox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452)
[08:45:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[08:45:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:47:54] <wikibugs>	 (03PS2) 10David Caro: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786)
[08:47:56] <wikibugs>	 (03PS2) 10David Caro: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786)
[08:47:58] <wikibugs>	 (03PS1) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786)
[08:48:03] <wikibugs>	 (03PS1) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742
[08:50:04] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:50:51] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) p:05Triage→03Medium
[08:50:55] <wikibugs>	 (03Abandoned) 10Thiemo Kreuz (WMDE): Remove $wgVisualEditorTransclusionDialogBackButton feature flag [extensions/Cite] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805748 (https://phabricator.wikimedia.org/T307188) (owner: 10Awight)
[08:51:02] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:41] <wikibugs>	 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) This doesn't appear to be an SRE-Access-Request. Adding the ServiceOps tags, as they are involved in the Kubernetes migration and it makes sense to loop t...
[08:51:51] <wikibugs>	 10SRE, 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10SLyngshede-WMF) p:05Triage→03Medium
[08:53:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro)
[08:53:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 (owner: 10David Caro)
[08:56:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Medium Email address is updated, everything else looks fine.
[08:56:04] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[08:56:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29767 and previous config saved to /var/cache/conftool/dbconfig/20220615-085656-marostegui.json
[08:56:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Gonna -1 this one for now to avoid accidental merging while some clarifications and communications are ongoing regarding this one." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes)
[08:57:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:57:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29768 and previous config saved to /var/cache/conftool/dbconfig/20220615-085744-marostegui.json
[08:57:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:08] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:55] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:02:46] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:03:08] <icinga-wm>	 PROBLEM - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100%
[09:05:19] <wikibugs>	 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) We tried: * New optic * New patch cable * New router port  And the errors are still present. Next step is to follow up with Telia so they check their side and then the DC for the X-connect.
[09:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:06:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:07:25] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:07:48] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:08:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10SLyngshede-WMF) 05Open→03Resolved
[09:09:48] <icinga-wm>	 RECOVERY - Host ms-be1059 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms
[09:10:39] <wikibugs>	 (03CR) 10Awight: "Cherry-pick for backport." [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight)
[09:12:01] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Restore internal mechanism to use either back or close button [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight)
[09:12:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29769 and previous config saved to /var/cache/conftool/dbconfig/20220615-091201-marostegui.json
[09:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:12:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29770 and previous config saved to /var/cache/conftool/dbconfig/20220615-091249-marostegui.json
[09:12:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10joanna_borun)
[09:12:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[09:12:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance
[09:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:54] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29771 and previous config saved to /var/cache/conftool/dbconfig/20220615-091257-marostegui.json
[09:12:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:06] <icinga-wm>	 RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:29] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche)
[09:14:56] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1059.eqiad.wmnet with OS bullseye
[09:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:00] <wikibugs>	 (03PS6) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195)
[09:15:04] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1059.eqiad.wmnet with OS bullseye
[09:15:08] <wikibugs>	 (03PS8) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251
[09:16:55] <elukey>	 vgutierrez: ack enjoy :)
[09:17:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:17:43] <wikibugs>	 (03PS2) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786)
[09:17:45] <wikibugs>	 (03PS2) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742
[09:17:47] <wikibugs>	 (03CR) 10Jbond: "update thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[09:18:03] <wikibugs>	 (03PS1) 10Mainframe98: Fix deletion of translation pages outside of NS_MAIN namespace [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440)
[09:19:40] <icinga-wm>	 PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:32] <marostegui>	 !log Reboot sanitarium hosts (db1154, db1155) wiki replicas will have lag
[09:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[09:22:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 (owner: 10David Caro)
[09:23:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro)
[09:24:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:27:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29772 and previous config saved to /var/cache/conftool/dbconfig/20220615-092706-marostegui.json
[09:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:27:12] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[09:27:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance
[09:27:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance
[09:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance
[09:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:00] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "In general LGTM, couple of minor comments in line." [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[09:29:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:32:34] <wikibugs>	 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon)
[09:32:48] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) 05Resolved→03Open Hi,  Sorry to reopen this task, but there's a licence issue, I think. When I try and use the HTML5 console, it says: "License Required This iLO is...
[09:39:21] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe20
[09:39:21] <icinga-wm>	 os-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[09:40:41] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766
[09:44:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4001.ulsfo.wmnet
[09:44:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:31] <wikibugs>	 (03PS1) 10Marostegui: change_localuser.lu_attached_timestamp_T302659.py: Replaced the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659)
[09:44:50] <wikibugs>	 (03PS2) 10Marostegui: change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659)
[09:45:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29773 and previous config saved to /var/cache/conftool/dbconfig/20220615-094532-marostegui.json
[09:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:38] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[09:46:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:47:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) (owner: 10Marostegui)
[09:48:03] <icinga-wm>	 RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:48:52] <wikibugs>	 (03Merged) 10jenkins-bot: change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) (owner: 10Marostegui)
[09:49:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4001.ulsfo.wmnet
[09:50:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[09:52:41] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:54:25] <icinga-wm>	 PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:57:37] <icinga-wm>	 PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:58:43] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:59:02] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Minor comments inline, one need to be fixed, would not work as is." [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[09:59:14] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "clarified comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[10:00:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29774 and previous config saved to /var/cache/conftool/dbconfig/20220615-100037-marostegui.json
[10:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:53] <godog>	 hnowlan: are the maps critical for prometheus-pg-replication-lag.service known ? seems to be flip/flopping
[10:02:29] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[10:02:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[10:02:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29775 and previous config saved to /var/cache/conftool/dbconfig/20220615-100235-marostegui.json
[10:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:40] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[10:03:44] <hnowlan>	 godog: a check was broken yesterday, there's a fix in review - will try to move it along
[10:03:55] <wikibugs>	 (03CR) 10Volans: "One missed thing I think, then LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[10:05:13] <godog>	 hnowlan: ah! ack, thanks
[10:06:27] <wikibugs>	 (03CR) 10Volans: "clarification comment" [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn)
[10:07:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:27] <wikibugs>	 (03CR) 10Volans: "I don't mind it, LGTM. I'm just a little bit worries that this will take people by surprise with the additional diff. We should communicat" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond)
[10:08:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[10:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[10:13:45] <icinga-wm>	 PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:15:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm adding Ben re: the analytics/data-engineering question and how these alerts should reach which team, HTH!" [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[10:15:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29776 and previous config saved to /var/cache/conftool/dbconfig/20220615-101543-marostegui.json
[10:15:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:21:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:03] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:22:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "This would leave failing test jobs around forever (requiring manual cleanup)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy)
[10:25:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:25:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:40] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Hi, Sorry for further noise, but: is it possible there's some USB thing still connected to this system? It won't reimage because of: ` [   12.234765] sd 0:0:0:0: [sda] A...
[10:27:17] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:39] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 es1030 es1028 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29777 and previous config saved to /var/cache/conftool/dbconfig/20220615-102929-root.json
[10:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29778 and previous config saved to /var/cache/conftool/dbconfig/20220615-103048-marostegui.json
[10:30:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[10:30:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[10:30:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:30:53] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[10:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:57] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:30:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29779 and previous config saved to /var/cache/conftool/dbconfig/20220615-103101-marostegui.json
[10:31:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:07] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:34:17] <wikibugs>	 (03PS3) 10Filippo Giunchedi: am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331)
[10:36:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29780 and previous config saved to /var/cache/conftool/dbconfig/20220615-103608-root.json
[10:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29781 and previous config saved to /var/cache/conftool/dbconfig/20220615-103615-root.json
[10:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews David!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) (owner: 10Filippo Giunchedi)
[10:39:22] <wikibugs>	 (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[10:39:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29782 and previous config saved to /var/cache/conftool/dbconfig/20220615-103933-root.json
[10:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:51] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:37] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944 (10joanna_borun)
[10:42:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10joanna_borun)
[10:43:49] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10joanna_borun)
[10:43:57] <wikibugs>	 (03PS2) 10Aklapper: Redirect dev.wikimedia.org to developer.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018)
[10:44:01] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10joanna_borun)
[10:44:33] <wikibugs>	 (03CR) 10Jbond: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[10:45:37] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:54] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10joanna_borun)
[10:46:01] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10joanna_borun)
[10:46:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[10:46:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[10:46:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:40] <wikibugs>	 10SRE, 10SRE-tools, 10Discovery-Search, 10Spicerack, and 2 others: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10joanna_borun)
[10:46:58] <wikibugs>	 10SRE, 10SRE-tools, 10Elasticsearch, 10Spicerack, and 2 others: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10joanna_borun)
[10:47:02] <wikibugs>	 (03CR) 10Aklapper: [C: 03+1] "Please merge." [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) (owner: 10Aklapper)
[10:47:22] <wikibugs>	 10SRE, 10SRE-tools, 10Spicerack, 10Discovery-Search (Current work), 10Patch-For-Review: Write cookbooks to support spicerack's elasticsearch multi cluster/instance - https://phabricator.wikimedia.org/T207919 (10joanna_borun)
[10:47:35] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Maps, and 3 others: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10joanna_borun)
[10:47:39] <wikibugs>	 10SRE, 10SRE-tools, 10Discovery-Search, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10joanna_borun)
[10:47:53] <wikibugs>	 10SRE, 10SRE-tools, 10Elasticsearch, 10Spicerack, and 2 others: Test spicerack elasticsearch module - https://phabricator.wikimedia.org/T207920 (10joanna_borun)
[10:47:59] <wikibugs>	 10SRE, 10SRE-tools, 10Spicerack, 10Wikidata, and 2 others: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10joanna_borun)
[10:48:15] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10joanna_borun)
[10:48:41] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:48:44] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 3 others: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10joanna_borun)
[10:48:49] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Maps, and 3 others: Create cookbook to reboot Maps - https://phabricator.wikimedia.org/T224072 (10joanna_borun)
[10:48:56] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10joanna_borun)
[10:49:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10joanna_borun)
[10:49:30] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10joanna_borun)
[10:49:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10joanna_borun)
[10:49:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[10:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29783 and previous config saved to /var/cache/conftool/dbconfig/20220615-105112-root.json
[10:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29784 and previous config saved to /var/cache/conftool/dbconfig/20220615-105119-root.json
[10:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:52] <wikibugs>	 (03PS1) 10Btullis: Enable CAS-SSO for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686)
[10:54:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29786 and previous config saved to /var/cache/conftool/dbconfig/20220615-105437-root.json
[10:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:53] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis)
[10:54:57] <marostegui>	 !log dbmaint es1@eqiad T310485
[10:54:59] <marostegui>	 !log dbmaint es2@eqiad T310485
[10:55:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:01] <marostegui>	 !log dbmaint es3@eqiad T310485
[10:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis)
[10:58:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable CAS-SSO for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis)
[11:01:07] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10SLyngshede-WMF) p:05Triage→03High @CMyrick-WMF I have added you to the WMF LDAP group, that should grant you access to Superset and Turnilo.   It does not grant you access t...
[11:01:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[11:04:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29787 and previous config saved to /var/cache/conftool/dbconfig/20220615-110435-marostegui.json
[11:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:40] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:06:13] <wikibugs>	 (03CR) 10D3r1ck01: Use a service locator to get a job runner (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[11:06:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29788 and previous config saved to /var/cache/conftool/dbconfig/20220615-110616-root.json
[11:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29789 and previous config saved to /var/cache/conftool/dbconfig/20220615-110623-root.json
[11:06:25] <wikibugs>	 (03Abandoned) 10D3r1ck01: Use a service locator to get a job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01)
[11:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:47] <icinga-wm>	 PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29790 and previous config saved to /var/cache/conftool/dbconfig/20220615-110940-root.json
[11:09:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:05] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:19:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29791 and previous config saved to /var/cache/conftool/dbconfig/20220615-111940-marostegui.json
[11:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:49] <icinga-wm>	 PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29792 and previous config saved to /var/cache/conftool/dbconfig/20220615-112119-root.json
[11:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29793 and previous config saved to /var/cache/conftool/dbconfig/20220615-112127-root.json
[11:21:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: remove Icinga checks for netbox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[11:21:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:21] <wikibugs>	 (03CR) 10D3r1ck01: "We discussed about this yesterday in PET technical discussion and we agreed to make a patch removing this file." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01)
[11:23:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] "Looks good and PCC checks out as well, merging: https://puppet-compiler.wmflabs.org/pcc-worker1003/35854/" [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche)
[11:23:19] <wikibugs>	 (03PS1) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782
[11:24:18] <moritzm>	 XioNoX: shall I merge your Icinga patch along?
[11:24:28] <XioNoX>	 moritzm: yeah I was about to
[11:24:31] <XioNoX>	 thanks
[11:24:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29794 and previous config saved to /var/cache/conftool/dbconfig/20220615-112444-root.json
[11:24:46] <moritzm>	 ack, doing that now
[11:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:56] <moritzm>	 merged
[11:25:33] <icinga-wm>	 PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:25:51] <wikibugs>	 (03PS2) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:26:38] <XioNoX>	 thx, checking
[11:26:47] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:27:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:28:36] <wikibugs>	 (03PS3) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:29:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29795 and previous config saved to /var/cache/conftool/dbconfig/20220615-112924-marostegui.json
[11:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:30] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[11:31:13] <wikibugs>	 (03PS1) 10Majavah: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783
[11:34:24] <wikibugs>	 (03PS1) 10Slyngshede: Grant access to analytics_privatedata_users to user ricby [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227)
[11:34:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29796 and previous config saved to /var/cache/conftool/dbconfig/20220615-113445-marostegui.json
[11:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29797 and previous config saved to /var/cache/conftool/dbconfig/20220615-113623-root.json
[11:36:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29798 and previous config saved to /var/cache/conftool/dbconfig/20220615-113631-root.json
[11:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:22] <wikibugs>	 (03PS4) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:38:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond)
[11:39:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29799 and previous config saved to /var/cache/conftool/dbconfig/20220615-113948-root.json
[11:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:16] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35859/console" [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:41:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:41:38] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35860/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm)
[11:42:52] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1 C: 03+2] P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede)
[11:44:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29800 and previous config saved to /var/cache/conftool/dbconfig/20220615-114430-marostegui.json
[11:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:14] <icinga-wm>	 RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:45:48] <icinga-wm>	 RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:40] <icinga-wm>	 RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:22] <icinga-wm>	 RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:46] <icinga-wm>	 RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:36] <icinga-wm>	 RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:36] <icinga-wm>	 RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29801 and previous config saved to /var/cache/conftool/dbconfig/20220615-114950-marostegui.json
[11:49:52] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[11:49:54] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[11:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:55] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[11:49:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29802 and previous config saved to /var/cache/conftool/dbconfig/20220615-115127-root.json
[11:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29803 and previous config saved to /var/cache/conftool/dbconfig/20220615-115135-root.json
[11:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:52:10] <icinga-wm>	 RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:10] <icinga-wm>	 RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:52:28] <icinga-wm>	 RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:54:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29804 and previous config saved to /var/cache/conftool/dbconfig/20220615-115452-root.json
[11:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:57:41] <wikibugs>	 (03PS2) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782
[11:57:49] <wikibugs>	 (03PS3) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782
[11:59:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29805 and previous config saved to /var/cache/conftool/dbconfig/20220615-115935-marostegui.json
[11:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227) (owner: 10Slyngshede)
[12:00:41] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Grant access to analytics_privatedata_users to user ricby [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227) (owner: 10Slyngshede)
[12:00:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet
[12:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10SLyngshede-WMF) p:05Triage→03High @Leila I've added Ricardo to the analytics-privatedata-users users group, and the NDA group in L...
[12:06:55] <wikibugs>	 (03PS6) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525)
[12:06:57] <wikibugs>	 (03PS3) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525)
[12:06:59] <wikibugs>	 (03PS3) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525)
[12:07:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet
[12:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:52] * kart_ updating cxserver; no major changes.
[12:10:31] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266) (owner: 10KartikMistry)
[12:14:12] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266) (owner: 10KartikMistry)
[12:14:14] <wikibugs>	 (03PS4) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276
[12:14:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29806 and previous config saved to /var/cache/conftool/dbconfig/20220615-121440-marostegui.json
[12:14:42] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:14:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:14:44] <stashbot>	 T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659
[12:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:14] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:16:16] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:16:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29807 and previous config saved to /var/cache/conftool/dbconfig/20220615-121620-marostegui.json
[12:16:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:24] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:30] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM - Could either ottomata or btullis merge this please?" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric)
[12:16:32] <TheresNoTime>	 Amir1: apergos: hihi, will be at the UTC early training tomorrow - I have nothing to deploy though :)
[12:16:57] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:16:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:32] <apergos>	 ok, thanks for the heads up!
[12:18:34] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric)
[12:19:17] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:19:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:40] <wikibugs>	 (03PS8) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[12:19:42] <wikibugs>	 (03PS22) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[12:19:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:20:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:21:02] <wikibugs>	 (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[12:21:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 es1033 es1034 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29808 and previous config saved to /var/cache/conftool/dbconfig/20220615-122123-root.json
[12:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm)
[12:23:06] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[12:23:24] <Lucas_WMDE>	 TheresNoTime: I could offer https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/628773 as a no-op dead code cleanup if you want something to deploy :P
[12:23:33] <Lucas_WMDE>	 (the rest of the chain is blocked, but the first change should be okay)
[12:23:50] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:13] <TheresNoTime>	 Lucas_WMDE: works for me, thank you! but I'll defer to those doing the training ^^'
[12:24:19] <Lucas_WMDE>	 ^^
[12:25:39] <kart_>	 !log Updated cxserver to 2022-06-15-074244-production (T309266, T310116, T309384, T306963)
[12:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:48] <stashbot>	 T310116: Enable Section Translation in Uzbek Wikipedia - https://phabricator.wikimedia.org/T310116
[12:25:48] <stashbot>	 T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384
[12:25:48] <stashbot>	 T306963: Integrate new section mapping database - https://phabricator.wikimedia.org/T306963
[12:25:49] <stashbot>	 T309266: Adjust default MT services for pairs where the default is not the most used - https://phabricator.wikimedia.org/T309266
[12:26:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5002.eqsin.wmnet
[12:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:24] <wikibugs>	 (03PS9) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811
[12:30:26] <wikibugs>	 (03PS23) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[12:34:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5002.eqsin.wmnet
[12:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:33] <XioNoX>	 Hello! hello! we're going to start the Netbox upgrade now, please refrain from using it either directly or via cookbook (makevm, decom, provision, etc) if you have a doubt feel free to ask
[12:38:03] <godog>	 godspeed XioNoX and others involved
[12:39:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29810 and previous config saved to /var/cache/conftool/dbconfig/20220615-123938-root.json
[12:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29811 and previous config saved to /var/cache/conftool/dbconfig/20220615-123943-root.json
[12:39:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29812 and previous config saved to /var/cache/conftool/dbconfig/20220615-123949-root.json
[12:39:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:16] <wikibugs>	 (03PS1) 10Jbond: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807
[12:42:31] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on netbox:443 with reason: Netbox upgrade to 3.2 T296452
[12:42:32] <logmsgbot>	 !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on netbox:443 with reason: Netbox upgrade to 3.2 T296452
[12:42:34] <moritzm>	 !log failover ganeti master in eqsin to ganeti5001
[12:42:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:37] <stashbot>	 T296452: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452
[12:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:49] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] Retire profile::logster_alarm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff)
[12:47:00] <wikibugs>	 (03PS24) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661)
[12:47:12] <wikibugs>	 (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[12:48:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29813 and previous config saved to /var/cache/conftool/dbconfig/20220615-124810-marostegui.json
[12:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:15] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[12:48:30] <logmsgbot>	 !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: log
[12:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:00] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[12:51:42] <logmsgbot>	 !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: log (duration: 03m 12s)
[12:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:04] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:54:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29815 and previous config saved to /var/cache/conftool/dbconfig/20220615-125442-root.json
[12:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29816 and previous config saved to /var/cache/conftool/dbconfig/20220615-125447-root.json
[12:54:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29817 and previous config saved to /var/cache/conftool/dbconfig/20220615-125452-root.json
[12:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:16] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.11.12
[12:55:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:21] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.11.12 (duration: 00m 05s)
[12:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:24] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service,netbox_ganeti_drmrs01_sync.service,netbox_ganeti_eqiad_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:55:47] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.11.12
[12:55:48] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:56:12] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[12:56:45] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.11.12 (duration: 00m 58s)
[12:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:03] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[12:57:12] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[12:57:39] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[12:57:51] <wikibugs>	 (03PS2) 10Ayounsi: Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452)
[12:58:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10MoritzMuehlenhoff) I also removed logsteralarms@ earlier the day, it's no longer needed.
[12:59:33] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:00:02] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1300).
[13:00:05] <jouncebot>	 awight and mainframe98: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:09] <Lucas_WMDE>	 o/
[13:00:12] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox1002.eqiad.wmnet with reason: Netbox upgrade to 3.2
[13:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:14] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox1002.eqiad.wmnet with reason: Netbox upgrade to 3.2
[13:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:17] <urbanecm>	 o/
[13:00:35] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox2002.codfw.wmnet with reason: Netbox upgrade to 3.2
[13:00:37] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox2002.codfw.wmnet with reason: Netbox upgrade to 3.2
[13:00:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:41] <urbanecm>	 Lucas_WMDE: you were here earlier, do you plan to deploy?
[13:00:42] <awight>	 I can deploy my patch, and happy to take care of mainframe98's if you wish?
[13:00:56] <Lucas_WMDE>	 awight: feel free to self-service
[13:00:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM. I think we can deploy this at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno)
[13:01:04] <Lucas_WMDE>	 and if you also want to deploy mainframe98’s patch that’s fine by me :)
[13:01:12] <mainframe98>	 awight: I'd appreciate that, thank you
[13:01:17] <awight>	 great!
[13:01:39] * mainframe98 goes to scrounge up a test case on mediawiki.org to use as test
[13:02:21] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.1
[13:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:45] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Backport deployment." [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight)
[13:03:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29818 and previous config saved to /var/cache/conftool/dbconfig/20220615-130315-marostegui.json
[13:03:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "actually, can you also add it to production's IS.php (with default => false)? it should work as-is, but adding to IS.php is a good practic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno)
[13:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:03] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.1 (duration: 01m 43s)
[13:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:47] <wikibugs>	 (03PS4) 10Jbond: Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:05:57] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[13:07:08] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:08:50] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.1
[13:08:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29819 and previous config saved to /var/cache/conftool/dbconfig/20220615-130946-root.json
[13:09:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29820 and previous config saved to /var/cache/conftool/dbconfig/20220615-130951-root.json
[13:09:53] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.1 (duration: 01m 03s)
[13:09:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29821 and previous config saved to /var/cache/conftool/dbconfig/20220615-130956-root.json
[13:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:42] <wikibugs>	 (03PS7) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525)
[13:10:44] <wikibugs>	 (03PS4) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525)
[13:10:46] <wikibugs>	 (03PS4) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525)
[13:10:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:11:05] <wikibugs>	 (03CR) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:11:07] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) @MatthewVernon  We do not have the licenses for Integrated Remote Console.  As for a USB key attached, there is not anything, I turned off the internal USB, can you try agai...
[13:11:21] <wikibugs>	 (03CR) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:11:22] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:12:35] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:12:48] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:13:16] <wikibugs>	 (03PS2) 10Volans: netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond)
[13:13:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah)
[13:14:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond)
[13:15:40] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:16:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond)
[13:16:33] <wikibugs>	 (03PS6) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716)
[13:18:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[13:18:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29822 and previous config saved to /var/cache/conftool/dbconfig/20220615-131820-marostegui.json
[13:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[13:19:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[13:19:40] <wikibugs>	 (03PS8) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705
[13:19:48] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[13:20:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[13:20:23] <wikibugs>	 (03PS7) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716)
[13:20:48] <awight>	 mainframe98: I'm still holding for CI, but wanted to ask you if your Translate patch is something that can be tested on test.wikipedia.org once it's merged?
[13:21:07] <wikibugs>	 (03PS8) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716)
[13:21:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+2] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond)
[13:21:24] <wikibugs>	 (03Merged) 10jenkins-bot: Restore internal mechanism to use either back or close button [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight)
[13:21:24] <awight>	 mainframe98: I think this is as far as the train has proceeded for wmf.16
[13:21:37] <mainframe98>	 awight: yes, but I don't have the required permissions. Is mediawiki.org an option?
[13:22:05] <mainframe98>	 Also, I'm not sure if that wiki has a translatable page that would error
[13:22:09] <awight>	 mainframe98: Unfortunately not, it's at wmf.15: https://www.mediawiki.org/wiki/Special:Version
[13:22:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[13:22:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10Cmjohnson) Per my discussion with @wiki_willy we are keeping all juniper gear for future donations.
[13:22:33] <awight>	 I'm okay with deploying blindly, if you can track versions and test once that becomes possible?
[13:22:52] <awight>	 (This would be the same as if your patch had been merged before the wmf.16 branch.)
[13:22:59] <wikibugs>	 (03CR) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[13:23:00] <mainframe98>	 Sure, I use that feature daily to fight vandalism; if it breaks, I'll know
[13:23:05] <awight>	 :-) ty!
[13:24:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29823 and previous config saved to /var/cache/conftool/dbconfig/20220615-132450-root.json
[13:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29824 and previous config saved to /var/cache/conftool/dbconfig/20220615-132454-root.json
[13:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29825 and previous config saved to /var/cache/conftool/dbconfig/20220615-132500-root.json
[13:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:13] <wikibugs>	 (03PS2) 10Volans: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:25:51] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35862/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[13:26:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:26:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:52] <awight>	 Ooof scap crashed with "sync-file failed: <AttributeError> 'Namespace' object has no attribute 'pause_after_testserver_sync'"
[13:27:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:27:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:27:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:43] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.2
[13:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:27:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:46] <awight>	 This was mentioned at https://phabricator.wikimedia.org/P29785 but there's no task yet?
[13:28:55] <awight>	 hnowlan: ^ did you find a workaround?
[13:28:59] <wikibugs>	 (03PS1) 10Volans: Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452)
[13:29:49] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.2 (duration: 02m 06s)
[13:29:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:02] <logmsgbot>	 !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.2
[13:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:46] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[13:30:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[13:31:11] <logmsgbot>	 !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.2 (duration: 01m 08s)
[13:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans)
[13:32:09] <wikibugs>	 (03PS3) 10Volans: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:33:23] <wikibugs>	 (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:33:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29826 and previous config saved to /var/cache/conftool/dbconfig/20220615-133326-marostegui.json
[13:33:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[13:33:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[13:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:31] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[13:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29827 and previous config saved to /var/cache/conftool/dbconfig/20220615-133334-marostegui.json
[13:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:03] <wikibugs>	 (03Merged) 10jenkins-bot: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:34:22] <wikibugs>	 (03PS13) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232)
[13:35:47] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey)
[13:36:11] <wikibugs>	 (03PS2) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532)
[13:36:34] <wikibugs>	 (03PS1) 10Ottomata: eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391)
[13:36:53] <wikibugs>	 (03CR) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno)
[13:38:16] <awight>	 FYI, I'm deploying with --force to work around a scap bug.
[13:38:17] <logmsgbot>	 !log awight@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWTransclusionDialog.js: Backport: [[gerrit:805745|Restore internal mechanism to use either back or close button (T310602)]] (duration: 00m 37s)
[13:38:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:20] <stashbot>	 T310602: Unexpected back button behavior in VisualEditor's citation dialog - https://phabricator.wikimedia.org/T310602
[13:38:43] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391) (owner: 10Ottomata)
[13:38:45] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Backport deployment." [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440) (owner: 10Mainframe98)
[13:38:47] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391) (owner: 10Ottomata)
[13:38:49] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[13:39:48] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847)
[13:39:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815
[13:39:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: use hostname for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847)
[13:39:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29828 and previous config saved to /var/cache/conftool/dbconfig/20220615-133954-root.json
[13:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29829 and previous config saved to /var/cache/conftool/dbconfig/20220615-133958-root.json
[13:40:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29830 and previous config saved to /var/cache/conftool/dbconfig/20220615-134004-root.json
[13:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:10] <hnowlan>	 awight: no unfortunately, filing a bug now 
[13:41:08] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[13:41:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:11] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[13:41:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:53] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-netbox.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:43:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) @cmjohnson Ah, OK, I sort-of assumed we had HTML5 console everywhere. Now I know better :)  I've just tried turning the system on again, and it's still finding the myste...
[13:45:22] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[13:45:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:25] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[13:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:36] <wikibugs>	 (03PS2) 10Ayounsi: wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452)
[13:46:49] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:47:07] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi)
[13:48:45] <wikibugs>	 (03PS1) 10Majavah: sonofgridengine: grid_configurator: fix parameter name [puppet] - 10https://gerrit.wikimedia.org/r/805818
[13:49:25] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002
[13:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:02] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002
[13:51:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:00] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002
[13:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:37] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002
[13:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29831 and previous config saved to /var/cache/conftool/dbconfig/20220615-135458-root.json
[13:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29832 and previous config saved to /var/cache/conftool/dbconfig/20220615-135502-root.json
[13:55:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29833 and previous config saved to /var/cache/conftool/dbconfig/20220615-135508-root.json
[13:55:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:25] <wikibugs>	 (03PS1) 10Slyngshede: systemd::timer::job cleanup now absent cronjobs. [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673)
[13:55:53] <wikibugs>	 (03Merged) 10jenkins-bot: Fix deletion of translation pages outside of NS_MAIN namespace [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440) (owner: 10Mainframe98)
[13:56:11] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[13:57:31] <logmsgbot>	 !log awight@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/Translate/src/PageTranslation/DeleteTranslatableBundleSpecialPage.php: Backport: [[gerrit:805749|Fix deletion of translation pages outside of NS_MAIN namespace (T310440)]] (duration: 00m 32s)
[13:57:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:36] <stashbot>	 T310440: Attempting to delete a translation page using Special:PageTranslationDeletePage shows an internal error - https://phabricator.wikimedia.org/T310440
[13:57:43] <awight>	 mainframe98: Deployed, thanks for the patch!
[13:58:08] <awight>	 !log EU afternoon backport window complete.
[13:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:11] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[13:58:31] <mainframe98>	 awight: Thank you. I'll followup with testing after the train progresses
[13:58:41] <awight>	 Great!
[13:59:45] <wikibugs>	 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10SLyngshede-WMF)
[13:59:52] <wikibugs>	 (03PS1) 10Volans: reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822
[14:00:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[14:00:46] <wikibugs>	 (03PS1) 10Majavah: P:toolforge::redis_sentinel: stop sentinel and puppet fighting [puppet] - 10https://gerrit.wikimedia.org/r/805823 (https://phabricator.wikimedia.org/T309014)
[14:01:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans)
[14:01:13] <wikibugs>	 (03CR) 10Volans: [C: 03+2] reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans)
[14:01:38] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:01:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:44] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:11] <wikibugs>	 (03Merged) 10jenkins-bot: reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans)
[14:02:28] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35863/console" [puppet] - 10https://gerrit.wikimedia.org/r/805823 (https://phabricator.wikimedia.org/T309014) (owner: 10Majavah)
[14:03:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:03:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:43] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:03:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29834 and previous config saved to /var/cache/conftool/dbconfig/20220615-140505-marostegui.json
[14:05:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:10] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:05:57] <andre>	 [URL redirect patch] Would someone please review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/791321 ? Thanks in advance!
[14:06:01] <wikibugs>	 (03PS2) 10Slyngshede: systemd::timer::job cleanup now absent cronjobs. [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673)
[14:07:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[14:07:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:22] <wikibugs>	 (03PS3) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572
[14:07:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[14:07:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond)
[14:07:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) @MatthewVernon I disabled the internal SD drive and internal USB. I am hoping that works, I do not want to disable external USBs or I cannot use them on-site.  Can you try n...
[14:07:43] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:08:12] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.9.4" for 558 hosts
[14:08:13] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet
[14:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:25] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:08:26] <wikibugs>	 (03PS1) 10Btullis: Update the version of the datahub containers that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/805826 (https://phabricator.wikimedia.org/T310079)
[14:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:31] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.9.4" completed for 558 hosts
[14:08:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:12] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:53] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:41] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:10] <jinxer-wm>	 (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[14:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:15:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS buster
[14:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:38] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:15:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:25] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet
[14:17:10] <jinxer-wm>	 (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[14:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:57] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:19:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:43] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:19:44] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29836 and previous config saved to /var/cache/conftool/dbconfig/20220615-142010-marostegui.json
[14:20:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:57] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:22:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:19] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:22:44] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:53] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:25] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:27:24] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage
[14:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:10] <logmsgbot>	 !log hnowlan@deploy1002 Synchronized private/PrivateSettings.php: T308670 credentials to access the similar-users service (duration: 03m 32s)
[14:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:13] <stashbot>	 T308670: Configure SimilarEditors in production with Similarusers credentials - https://phabricator.wikimedia.org/T308670
[14:30:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage
[14:30:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:00] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[14:31:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:48] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29838 and previous config saved to /var/cache/conftool/dbconfig/20220615-143515-marostegui.json
[14:35:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:58] <andre>	 [URL redirect patch] Would someone please review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/791321 ? Thanks in advance!
[14:36:09] <andre>	 (^ @dcaro maybe? :)
[14:38:48] <andre>	 hashar: Do you maybe know who could +2 https://gerrit.wikimedia.org/r/c/integration/docroot/+/791111 ? TIA :)
[14:41:05] <dcaro>	 looking
[14:42:46] <dcaro>	 andre: merged :)
[14:43:20] <dcaro>	 andre: is there anything else needed for it to take effect? (aside from waiting for puppet to run)
[14:43:37] <andre>	 dcaro: I hope not. :P Thank you a lot!
[14:43:42] <dcaro>	 👍
[14:46:00] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:30] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[14:46:47] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:46:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:37] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[14:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:06] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001"
[14:49:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:34] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001"
[14:49:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29839 and previous config saved to /var/cache/conftool/dbconfig/20220615-145020-marostegui.json
[14:50:22] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[14:50:24] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[14:50:25] <urandom>	 !log ALTER-ing replication for codfw (Cassandra) expansion -- T307641
[14:50:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29840 and previous config saved to /var/cache/conftool/dbconfig/20220615-145028-marostegui.json
[14:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:30] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[14:50:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:39] <stashbot>	 T307641: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641
[14:50:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:05] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.pdus.uptime
[14:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:14] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.pdus.uptime (exit_code=97)
[14:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:28] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password
[14:53:28] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.pdus.rotate-password (exit_code=99)
[14:53:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:36] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password
[14:53:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:44] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0)
[14:53:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:57] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password
[14:54:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:07] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0)
[14:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet
[14:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:21] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1059.eqiad.wmnet with reason: host reimage
[14:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:58] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[14:58:09] <XioNoX>	 jbond, volans and I are happy to announce that Netbox got successfully upgraded. You can now resume using it, as well as cookbooks.
[14:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:55] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset
[14:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:18] <logmsgbot>	 !log jbond@cumin1001 Updating IPMI password on 1 hosts - jbond@cumin1001
[14:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:21] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0)
[14:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:04] <jouncebot>	 brennen, thcipriani, and mutante: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1500).
[15:00:29] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1059.eqiad.wmnet with reason: host reimage
[15:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:38] <thcipriani>	 o/
[15:00:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet
[15:00:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:34] <jynus>	 let me know before any write happens to stop 2 of the replicas for phabricator, as an extra rollback protection, thcipriani
[15:01:38] <brennen>	 o/
[15:02:04] <thcipriani>	 jynus: will do! Thank you!
[15:02:28] <jynus>	 that way, should the worst thing happen, we have immediate rollback, as if nothing had happened
[15:02:42] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:09] <mutante>	 here :)
[15:03:24] <mutante>	 !log phabricator maintenance about to start
[15:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:05] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance
[15:05:07] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance
[15:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:05:26] <jynus>	 is there a ticket (I know it won't be of much use :-)
[15:05:29] <jynus>	 ?
[15:06:57] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[15:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:30] <mutante>	 silence submitted for phabricator in alertmanager
[15:08:37] <mutante>	 icinga downtime sent via cumin
[15:08:48] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[15:09:16] <volans>	 mutante: FYI the downtime cookbook does downtime the host also on alermanager, anything host-related
[15:11:06] <mutante>	 volans: icinga was phab1001.eqiad.wmnet, alertmanager was phabricator.wikimedia.org:443
[15:11:27] <mutante>	 it has 2 IPs. so not so sure about that
[15:11:57] <volans>	 ack, we need to improve the downtime cookbook to allow to downtime  those toos in AM, all the bits are already in spicerack
[15:12:01] <jynus>	 Can Not Connect to MySQL - should I stop things now?
[15:12:51] <mutante>	 jynus: not yet, we are not writing anything
[15:12:55] <jynus>	 ok
[15:15:09] <icinga-wm>	 PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 5255 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[15:15:48] <jynus>	 some downtime was missing, I guess=
[15:15:56] * Emperor here
[15:16:10] <moritzm>	 maintenance, closing the incident
[15:16:25] <Emperor>	 typical I get paged when downstairs making tea.
[15:16:33] <Emperor>	 moritzm: thanks
[15:16:34] <jhathaway>	 thanks moritzm 
[15:16:55] <icinga-wm>	 RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Phabricator
[15:17:11] <hashar>	 ^^ known, it is being upgraded
[15:17:24] <mutante>	 sorry,I tried to prevent exactly that
[15:17:26] <mutante>	 with the downtimes
[15:17:45] <jbond>	 fyi the page was generated from the new prometheus::blackbox::check::http (https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/phabricator/monitoring.pp#L50) its possible the cookbook needs updateing to take care of checks like this (cc godog volans )
[15:18:12] <jynus>	 actually interesting ^
[15:18:22] <jynus>	 should I file a ticket about this?
[15:18:24] <volans>	 ni
[15:18:26] <volans>	 no
[15:18:33] <volans>	 those are two separate things
[15:18:34] <jynus>	 ok
[15:18:43] <mutante>	 the icinga alert is because there is a virtual host in icinga for phabricator.wikimedia.org
[15:18:45] <volans>	 the downtime cookbook downtimes hosts as of now, a separate silence was added manally
[15:18:53] <mutante>	 and you cant send downtimes to virtual hosts, afaict
[15:19:00] <volans>	 mutante: yes you can
[15:19:03] <jbond>	 oh sorry i missed it was an icinga alert ignore me 
[15:19:24] <volans>	  --force               Override the check that use a Cumin query to validate the given hosts. Useful when you want to downtime a "host" that is not a real host like
[15:19:27] <volans>	                         a service or not anymore queryable via Cumin.
[15:20:18] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: maintenace
[15:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:20] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: maintenace
[15:20:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:24] <mutante>	 thanks volans. done!
[15:20:25] <mutante>	 ==> Will downtime 1 unverified hosts: phabricator.wikimedia.org
[15:20:30] <mutante>	 with --force
[15:20:37] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host ms-be1059.eqiad.wmnet with OS bullseye
[15:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:49] <mutante>	 jbond: i think both might be true
[15:21:18] <jbond>	 mutante:  yes its possible the other one would have tiggered eventully yes
[15:22:00] <jynus>	 still, the duality of host vs service model, inherited from icinga may cause confusion for some time, I predict
[15:23:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29841 and previous config saved to /var/cache/conftool/dbconfig/20220615-152315-marostegui.json
[15:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:20] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[15:24:05] <mutante>	 icinga has 2 hosts, phab1001.eqiad.wmnet with all the standard services on it AND phabricator.wikimedia.org as a virtual host 
[15:24:16] <mutante>	 then there is alertmanager that has phabricator.wikimedia.org
[15:24:26] <mutante>	 and then in addition to both of these..there was what created the actual page
[15:24:44] <jynus>	 more important, volans: https://www.youtube.com/watch?v=zIV4poUZAQo
[15:24:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet
[15:24:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:52] <mutante>	 so 3 downtimes but still not the one that would have prevented the page, afaict
[15:25:43] <mutante>	 apache and phd are being stopped now
[15:27:03] <icinga-wm>	 RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:29:35] <mutante>	 volans: currently we are replacing 'puppet agent' commands in the maintenance script with enable-puppet/disable-puppet/.. heh
[15:29:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet
[15:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:49] <volans>	 thanks!
[15:33:30] <brennen>	 jynus: we're ready to go ahead here if you want to stop replication
[15:33:45] <jynus>	 ok, logging it, and wait for my ok
[15:33:55] <brennen>	 jynus: ack, thanks
[15:34:11] <jynus>	 !log stopping replication for m3 on db1117, db2078 
[15:34:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:50] <jynus>	 brennen: confirmed replication stopped on selected hosts, you can continue
[15:35:00] <brennen>	 jynus: thanks, going ahead
[15:35:24] <jynus>	 this is the point in time we will quickly rollback in case of the worst
[15:35:33] <thcipriani>	 <3
[15:35:35] <brennen>	 !log starting phabricator deploy, momentary downtime expected while Apache restarts and migrations run
[15:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:50] <mutante>	 thanks jynus, it's better this way, we appreciate it
[15:36:01] <RhinosF1>	 Here for moral support seen as I filed the upgrade task
[15:38:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to  and previous config saved to /var/cache/conftool/dbconfig/20220615-153820-marostegui.json
[15:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:17] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[15:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:20] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[15:39:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:24] <edsanders>	 Phabricator down?
[15:39:36] <Lucas_WMDE>	 planned maintenance
[15:39:40] <edsanders>	 thanks
[15:40:22] <mutante>	 !log phabricator upgrade in progress
[15:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:00] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet
[15:49:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:59] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[15:50:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:18] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[15:50:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:23] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[15:51:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:27] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[15:51:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:32] <Dylsss>	 https://phabricator.wikimedia.org/source/mediawiki/
[15:51:46] <Dylsss>	 Unable to Retrieve Paths
[15:51:46] <Dylsss>	 Command failed with error #1! COMMAND /usr/bin/sudo -E -n -u phd -- git ls-tree -z -l 68972c30d27f1b3a6e268cac0e64a7f78e8d3bb7 -- STDOUT (empty) STDERR sudo: a password is required 
[15:52:37] <hnowlan>	 Probably related to the ongoing maintenance 
[15:52:48] <mutante>	 Dylsss: thanks for reporting this
[15:52:56] <mutante>	 people are looking at it
[15:53:05] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[15:53:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:13] <Dylsss>	 👍
[15:53:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P29843 and previous config saved to /var/cache/conftool/dbconfig/20220615-155325-marostegui.json
[15:53:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet
[15:53:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:53:34] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[15:53:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:01] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[15:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:19] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[15:55:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:54] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[15:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:22] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[15:56:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:08] <jynus>	 mutante: relay the "SECURITY information for phab1001.eqiad.wmnet" ?
[15:59:37] <mutante>	 jynus: where do you see that?
[15:59:43] <jynus>	 root mail
[15:59:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:01:03] <mutante>	 jynus: ah, ACK. thanks. I know why it happened. it's literally users trying to debug the error reported above
[16:01:15] <jynus>	 ah, ok
[16:01:15] <mutante>	 sudo commands being tested
[16:01:32] <jynus>	 did report it in case it needed ops attention
[16:01:52] <mutante>	 yes, thanks!:)
[16:02:39] <robh>	 hey, the drop down to change task spaces seems gone now?
[16:02:49] <robh>	 i dont have to do it often, but i happen to have one now and cannot edit and change space.
[16:04:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:05:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1001.eqiad.wmnet with OS buster
[16:05:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:19] <jinxer-wm>	 (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:05:58] <robh>	 ok
[16:06:09] <robh>	 mutante: heyas, so i lost the ability to see task spaces or view rights on task since update?
[16:06:19] <robh>	 someone report this already?
[16:06:22] <robh>	 or am i first?
[16:06:44] <jhathaway>	 regarding, thanos-query, looking...
[16:07:03] <mutante>	 robh: you are first. there is still other ongoing stuff being debugged right now
[16:07:10] <jynus>	 is that the metrics frontend?
[16:07:25] <robh>	 Ok, cool.  Yeah so I used to be able to click 'edit task' and see both the space and the view/edit rights and now can see none of those things
[16:07:29] <robh>	 which is problematic heh
[16:07:49] <robh>	 i now seem to be prsented with the form for users with no advanced rights heh
[16:08:17] <robh>	 so i can edit the basic task fields like assign, title, status, priority, description, tags, subscribers, and due date.  just cannot see space, edit and view rights
[16:08:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:30] <robh>	 just add to the list ; D
[16:08:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29844 and previous config saved to /var/cache/conftool/dbconfig/20220615-160830-marostegui.json
[16:08:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[16:08:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance
[16:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:35] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[16:08:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29845 and previous config saved to /var/cache/conftool/dbconfig/20220615-160838-marostegui.json
[16:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:10:18] <jinxer-wm>	 (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:11:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1006.mgmt.eqiad.wmnet with reboot policy FORCED
[16:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:07] <mutante>	 robh: could you maybe create a ticket? there is a lot going on still
[16:12:29] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1007.mgmt.eqiad.wmnet with reboot policy FORCED
[16:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:35] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:12:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1008.mgmt.eqiad.wmnet with reboot policy FORCED
[16:13:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:50] <jynus>	 elastic doesn't seem to be 100% happy still: https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&orgId=1&refresh=5m&from=1655298817054&to=1655309617054
[16:14:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:15:03] <robh>	 mutante: ok, doing now!
[16:15:19] <robh>	 just phab tag or anything else ya think?
[16:15:22] <thcipriani>	 Dylsss: we think we got that problem, thanks for the report
[16:16:00] <mutante>	 robh: thank you! yea, just phab tag is good enough. people already start looking
[16:17:36] <robh>	 cool, done and done, and yeah its not like a ubn
[16:17:43] <robh>	 im sure there are ubn tasks pending ;D
[16:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:21:34] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host backup1009.eqiad.wmnet
[16:21:35] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host backup1009.eqiad.wmnet
[16:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:24:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:24:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:11] <logmsgbot>	 !log krinkle@deploy1002 Synchronized multiversion/: Id8cdb8aef70f6672 (duration: 03m 41s)
[16:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:31] <mutante>	 Dylsss: the issue you reported should be gone
[16:27:42] <mutante>	 robh: your issue will still be debugged
[16:27:44] <Dylsss>	 Yep, it is gone
[16:27:48] <mutante>	 great
[16:29:13] <icinga-wm>	 PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:29:41] <robh>	 mutante: no doubt, i just didnt want to make it sound urgent cuz i talked in irc first is all
[16:29:45] <brennen>	 !log phabricator upgrade finished
[16:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:01] <brennen>	 jynus: i think we are ready to re-enable replication if you're still around
[16:30:14] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1006.mgmt.eqiad.wmnet with reboot policy FORCED
[16:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:17] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1008.mgmt.eqiad.wmnet with reboot policy FORCED
[16:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1007.mgmt.eqiad.wmnet with reboot policy FORCED
[16:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1009.mgmt.eqiad.wmnet with reboot policy FORCED
[16:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1010.mgmt.eqiad.wmnet with reboot policy FORCED
[16:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1011.mgmt.eqiad.wmnet with reboot policy FORCED
[16:31:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:13] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:32:17] <icinga-wm>	 PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:34:53] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:37:30] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1012.mgmt.eqiad.wmnet with reboot policy FORCED
[16:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:24] <jynus>	 brennen: I just saw the ping
[16:40:45] <jynus>	 phab looking good?
[16:41:17] <brennen>	 jynus: yep, all good.
[16:42:08] <jynus>	 I will start the eqiad one leave the codfw stopped
[16:42:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29847 and previous config saved to /var/cache/conftool/dbconfig/20220615-164222-marostegui.json
[16:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:28] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[16:42:30] <jynus>	 to still have a "we didn't realize some critical bug" or something
[16:42:38] <jynus>	 but to reenable eqiad redundancy
[16:42:39] <brennen>	 perfect
[16:44:05] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:44:08] <jynus>	 !log reestarting replication for m3 on db1117, not db2078
[16:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:49:56] <brennen>	 jouncebot nowandnext
[16:49:56] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1500)
[16:49:56] <jouncebot>	 In 1 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800)
[16:49:56] <jouncebot>	 In 1 hour(s) and 10 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800)
[16:53:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures
[16:54:48] <brennen>	 !log train 1.39.0-wmf.16 (T308069): no current blockers - rolling to group0
[16:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:53] <stashbot>	 T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069
[16:57:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29848 and previous config saved to /var/cache/conftool/dbconfig/20220615-165727-marostegui.json
[16:57:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:43] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:24] <hashar>	 taavi: legoktm: Reedy: looks like Wikibugs died somehow :-\  I have no idea how to restart it though but there is some doc at https://www.mediawiki.org/wiki/Wikibugs
[17:03:29] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.16  refs T308069
[17:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:34] <stashbot>	 T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069
[17:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:06:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:06:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:07:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1011.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:03] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:15] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1009.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1010.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:45] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1012.mgmt.eqiad.wmnet with reboot policy FORCED
[17:10:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:42] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1013.mgmt.eqiad.wmnet with reboot policy FORCED
[17:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:05] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1014.mgmt.eqiad.wmnet with reboot policy FORCED
[17:12:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29849 and previous config saved to /var/cache/conftool/dbconfig/20220615-171233-marostegui.json
[17:12:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:30] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1015.mgmt.eqiad.wmnet with reboot policy FORCED
[17:14:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:51] <brennen>	 things seem stable at group0, taking a break before regular train window.
[17:27:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29851 and previous config saved to /var/cache/conftool/dbconfig/20220615-172738-marostegui.json
[17:27:40] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[17:27:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance
[17:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:44] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[17:27:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:33:08] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:40] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1015.mgmt.eqiad.wmnet with reboot policy FORCED
[17:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:44] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1014.mgmt.eqiad.wmnet with reboot policy FORCED
[17:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:46] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1013.mgmt.eqiad.wmnet with reboot policy FORCED
[17:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:39:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:39:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:48] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1010.mgmt.eqiad.wmnet with reboot policy FORCED
[17:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:01] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:41:33] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS buster
[17:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:45] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[17:52:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1014.eqiad.wmnet with OS buster
[17:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[17:54:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance
[17:54:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:50] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: Maintenance
[17:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:00] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: Maintenance
[17:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:55:31] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS buster
[17:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:30] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1010.mgmt.eqiad.wmnet with reboot policy FORCED
[17:58:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:46] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1015.eqiad.wmnet with OS buster
[17:58:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:05] <jouncebot>	 brennen and jeena: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800).
[18:00:05] <jouncebot>	 brennen and jeena: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800).
[18:00:41] <brennen>	 o/
[18:04:55] <jeena>	 o/
[18:06:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:58] <logmsgbot>	 !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.16  refs T308069
[18:07:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:05] <stashbot>	 T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069
[18:07:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:07:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:41] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:10:42] <logmsgbot>	 !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.16  refs T308069 (duration: 03m 43s)
[18:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:15] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[18:13:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[18:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:27] <jynus>	 brennen: something weird with db since 15:55
[18:14:49] <jynus>	 I think it is only codfw
[18:15:04] <jynus>	 maybe there is maintenance
[18:15:58] <jynus>	 yeah, I think there is s1 maintenance on codfw, probably ignorable
[18:16:31] <jynus>	 sorry, I thought it was deployment-related
[18:17:30] <jeena>	 Thanks for checking on it jynus
[18:18:14] <jynus>	 I checked after the kafka thing, but it could be something else (there is not a lot of logs created)
[18:19:11] <brennen>	 jynus: ack, thx
[18:19:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[18:19:55] <jinxer-wm>	 (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[18:19:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[18:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:30] <jynus>	 ^this increase is 300% so something else must be going on
[18:20:44] <jynus>	 (it is not the db thing)
[18:21:16] <brennen>	 hmm - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&viewPanel=2&refresh=5m doesn't seem to correlate with deploys particularly
[18:21:33] <jynus>	 aqs_cassandra, I think?
[18:21:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[18:21:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[18:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29853 and previous config saved to /var/cache/conftool/dbconfig/20220615-182140-marostegui.json
[18:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:45] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[18:21:53] <jynus>	 should we notify data engineering?
[18:23:10] <mainframe98>	 awight: post train checkup report: backport worked. Thanks again!
[18:23:31] <jynus>	 https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&var-datasource=eqiad+prometheus%2Fops&var-input=kafka%2Fclienterror-eqiad&viewPanel=39&from=1655306604837&to=1655317404837
[18:24:41] <jynus>	 even if it is service specific, I am worried it could impact logging for other services
[18:24:55] <jinxer-wm>	 (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike
[18:26:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[18:26:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:10] <brennen>	 re: notifying data engineering - yes?  i'm over my head here.
[18:28:28] <jynus>	 aqs is them right?
[18:28:35] <jynus>	 I am not 100% sure
[18:29:32] <brennen>	 based on https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS i think so?
[18:29:46] <jynus>	 does anyone know someone that would be up right now?
[18:30:00] <jynus>	 e.g. in americas timezone?
[18:31:02] <brennen>	 milimetric and ottomata appear to be in US tz
[18:31:31] <jynus>	 I guess that is ping enough :-)
[18:31:35] <ottomata>	 hello
[18:31:47] <icinga-wm>	 RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:32:27] <ottomata>	 reading backscroll but still not sreu what is up?
[18:32:31] <jynus>	 ottomata: https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m AQS is creating 300% more logs than all other infra since a few minutes ago
[18:32:40] <moritzm>	 brennen: the AQS cluster in codfw is still being setup
[18:33:00] <ottomata>	 cc btullis 
[18:33:08] <moritzm>	 https://phabricator.wikimedia.org/T309808
[18:33:59] <jynus>	 https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&viewPanel=39&from=1655307212993&to=1655318012993 looks scary, may be delaying the global logging infra
[18:35:09] <btullis>	 Argh, I'm away from keyboard at the moment. I think that this is related to work that urandom has been doing on the new aqs cluster in codfw.
[18:35:16] <jynus>	 (maybe not, but looks worrying anyway)
[18:35:45] <jynus>	 is someone can check other production logs not impacted, it can wait
[18:36:01] <jynus>	 (and no production aqs impact)
[18:36:31] <urandom>	 the aqs service was deployed along with Cassandra in codfw, and for some reason it's not happy
[18:37:00] <btullis>	 lmata and cwhite notified is about the logspam a while ago, but then it subsided. Looks like it's back to noisy again.
[18:37:26] <urandom>	 the log messages are its attempt to tell us why, but I don't know what "connection" means (that's the entirety of the message)
[18:37:57] <urandom>	 I just restarted it on one node, maybe that caused a spike?
[18:38:28] <btullis>	 Can we just stop rsyslog on aqsw* to stop the flow of messages into Logstash?
[18:38:45] <jynus>	 it started at 18:05
[18:39:01] <btullis>	 Sorry, typing on phone. Rad was supposed to be aqs2*
[18:40:56] <cwhite>	 I rolled out some filters about 30min ago to drop the overly verbose logs from Kafka.  Will take a while to burn through the backlog.
[18:42:10] <jynus>	 I may have worried more than needed- I just checked and logs for other services seem to be fresh, so no impact there, AFAICS
[18:42:28] <btullis>	 Maybe the aqs service isn't happy if it's trying to contact druid in eqiad. Just a thought.
[18:42:37] <hashar>	 !log wikibugs (irc bot for Phabricator/Gerrit) is no more working and would need a restart T310734
[18:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:43] <stashbot>	 T310734: Wikibugs no more sends Gerrit/Phabricator announcements to IRC 2022-06-15 - https://phabricator.wikimedia.org/T310734
[18:43:44] <btullis>	 Thanks for the updates all.
[18:44:37] <urandom>	 btullis: no idea, these error messages of no help
[18:57:02] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle)
[19:00:20] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle)
[19:01:12] <wikibugs>	 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expecta...
[19:01:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29854 and previous config saved to /var/cache/conftool/dbconfig/20220615-190140-marostegui.json
[19:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:45] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[19:06:52] <wikibugs>	 (03PS1) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T291946)
[19:07:47] <wikibugs>	 (03PS2) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723)
[19:09:48] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928)
[19:09:50] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus: scrap Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928)
[19:12:07] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35879/" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[19:12:18] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle)
[19:13:15] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 262 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:16:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29855 and previous config saved to /var/cache/conftool/dbconfig/20220615-191645-marostegui.json
[19:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:57] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35880/prometheus1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi)
[19:19:41] <wikibugs>	 (03PS4) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574)
[19:20:42] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35881/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[19:20:49] <wikibugs>	 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) just a note for serviceops: policy.wikimedia.org is not currently under the control of SRE/prod servers...
[19:23:13] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Complete Netbox prometheus scraping - https://phabricator.wikimedia.org/T243928 (10ayounsi) a:03ayounsi
[19:23:35] <wikibugs>	 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Dzahn) Looks like T310738 would make this obsolete.
[19:28:54] <wikibugs>	 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) There are incoming redirects into policy.wikimedia.org:  https://wikimedia.org/stopsurveillance -> http...
[19:31:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) dse-k8s-worker1005   e1  U33     port 33   Cableid 20220052  dse-k8s-worker1006   e3  U33     port  33  Cableid  20220060    dse-k8s-worker1007   f1   U33     port 33   Cable...
[19:31:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr)
[19:31:09] <hashar>	 !log wikibugs IRC bot has been restarted by valhallasw \o/ # T310734
[19:31:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:14] <stashbot>	 T310734: Wikibugs no more sends Gerrit/Phabricator announcements to IRC 2022-06-15 - https://phabricator.wikimedia.org/T310734
[19:31:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[19:31:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29856 and previous config saved to /var/cache/conftool/dbconfig/20220615-193150-marostegui.json
[19:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:18] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC for dnsbox and centrallog hosts: https://puppet-compiler.wmflabs.org/pcc-worker1003/35882/" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[19:37:14] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC looks OK for existing bird hosts but this change is not ready for review yet. DO NOT MERGE WIP!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[19:40:15] <icinga-wm>	 PROBLEM - SSH on ms-be2041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:43:25] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: don't crash with "provider network" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591)
[19:46:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29857 and previous config saved to /var/cache/conftool/dbconfig/20220615-194655-marostegui.json
[19:46:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[19:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:46:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[19:47:00] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[19:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29858 and previous config saved to /var/cache/conftool/dbconfig/20220615-194703-marostegui.json
[19:47:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:03] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@b95391b]: Add Developer Portal - T302809
[19:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:08] <stashbot>	 T302809: Add dev portal to list of microsites on doc.wikimedia.org - https://phabricator.wikimedia.org/T302809
[19:50:14] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@b95391b]: Add Developer Portal - T302809 (duration: 00m 10s)
[19:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:56:07] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] wmf-config: Add audience to gdi-survey on cawiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan)
[20:01:01] <RoanKattouw>	 jouncebot: next
[20:01:01] <jouncebot>	 In 9 hour(s) and 58 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0600)
[20:01:03] <wikibugs>	 (03PS5) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574)
[20:01:28] <RoanKattouw>	 jouncebot: now
[20:01:28] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T2000)
[20:01:36] <RoanKattouw>	 Huh not sure why the bot didn't announce it
[20:01:45] <wikibugs>	 (03PS2) 10Catrope: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890)
[20:01:55] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope)
[20:02:13] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35883/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh)
[20:02:52] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope)
[20:07:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:08:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:27] <logmsgbot>	 !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:804014|Remove unused setting wgQuickSurveysUseVue (T285890)]] (duration: 03m 38s)
[20:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:08:30] <stashbot>	 T285890: Remove OOUI surveys and default to Vue.js - https://phabricator.wikimedia.org/T285890
[20:09:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:24] <wikibugs>	 (03CR) 10BCornwall: "I'd love feedback on whether I should explore averaging the values so that flips between 1 and 2 are not ignored." [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall)
[20:20:41] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:22:02] <wikibugs>	 (03PS1) 10Dzahn: cloud/devtools: fix hiera data for renamed gitlab-runner instance [puppet] - 10https://gerrit.wikimedia.org/r/805900
[20:22:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix hiera data for renamed gitlab-runner instance [puppet] - 10https://gerrit.wikimedia.org/r/805900 (owner: 10Dzahn)
[20:25:44] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder)
[20:36:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined
[20:41:25] <icinga-wm>	 RECOVERY - SSH on ms-be2041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:47:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29859 and previous config saved to /var/cache/conftool/dbconfig/20220615-204717-marostegui.json
[20:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:22] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[21:02:11] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:02:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29860 and previous config saved to /var/cache/conftool/dbconfig/20220615-210223-marostegui.json
[21:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:41] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:11:43] <wikibugs>	 (03PS3) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099)
[21:17:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29861 and previous config saved to /var/cache/conftool/dbconfig/20220615-211728-marostegui.json
[21:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond)
[21:21:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond)
[21:29:31] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond)
[21:32:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29862 and previous config saved to /var/cache/conftool/dbconfig/20220615-213233-marostegui.json
[21:32:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[21:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:36] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[21:32:37] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[21:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29863 and previous config saved to /var/cache/conftool/dbconfig/20220615-213241-marostegui.json
[21:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson)
[21:46:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[21:49:40] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS buster
[21:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:49:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster
[22:02:03] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage
[22:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS buster
[22:02:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster
[22:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29864 and previous config saved to /var/cache/conftool/dbconfig/20220615-220329-marostegui.json
[22:03:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:34] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[22:04:21] <icinga-wm>	 PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:05:13] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage
[22:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:10] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[22:12:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster
[22:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w...
[22:14:32] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage
[22:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:44] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster
[22:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster
[22:17:00] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1016.eqiad.wmnet with OS buster
[22:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:05] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster executed with errors: - aqs1016...
[22:17:37] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage
[22:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:56] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1014.eqiad.wmnet with OS buster
[22:17:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster completed:...
[22:18:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P29865 and previous config saved to /var/cache/conftool/dbconfig/20220615-221834-marostegui.json
[22:18:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:22] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1015.eqiad.wmnet with OS buster
[22:31:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster completed:...
[22:33:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P29866 and previous config saved to /var/cache/conftool/dbconfig/20220615-223339-marostegui.json
[22:33:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:00] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Cmjohnson) 1014 and 1015 are installed, 1016 shows that no cables are connected. John will look at that in the morning.
[22:46:12] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29867 and previous config saved to /var/cache/conftool/dbconfig/20220615-224845-marostegui.json
[22:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:50] <stashbot>	 T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011
[23:39:11] <wikibugs>	 (03PS1) 10Cwhite: logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760)