[00:13:37] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-06-07 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:19:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:25:45] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-06-07 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:25:45] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:41:40] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:42:59] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:01] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:38] (03PS1) 10Krinkle: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532) [01:04:56] (03PS1) 10Krinkle: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532) [01:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:06:57] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:59] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:07] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:34:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:05] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:59] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:35] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-06-14 00:00:01 (3153 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:52:22] (03CR) 10Tim Starling: [C: 03+2] MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle) [01:52:28] (03CR) 10Tim Starling: [C: 03+2] MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle) [01:57:59] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:03] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:59:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:59:39] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-06-14 00:00:02 (3132 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:00:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.300 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:01:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48249 bytes in 0.124 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:09:17] (03Merged) 10jenkins-bot: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/805436 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle) [02:09:49] (03Merged) 10jenkins-bot: MessageCache: Increase the MapCacheLRU size [core] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805435 (https://phabricator.wikimedia.org/T310532) (owner: 10Krinkle) [02:11:39] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [02:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:13:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:24] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.15/includes/cache/MessageCache.php: T310532 (duration: 03m 29s) [02:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:29] T310532: Investigate McRouter GET request spike from wmf.15 - https://phabricator.wikimedia.org/T310532 [02:19:45] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:21:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:37] !log tstarling@deploy1002 Synchronized php-1.39.0-wmf.16/includes/cache/MessageCache.php: (no justification provided) (duration: 03m 36s) [02:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:00] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:57] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:55] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:37] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:03] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:18:13] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:19:35] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:20:59] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:05] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:59] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:59] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:05] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [03:45:39] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:46:01] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:04:01] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:10:45] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:17:15] RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:22:01] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:24:03] PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:51] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:45] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:01] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:01] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:53] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:07] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:57] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:51:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:53:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1173.eqiad.wmnet with OS bullseye [04:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:47] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:00:01] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:17] !log Reboot dbproxy1016 and dbproxy1021 T310484 [05:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:21] T310484: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 [05:04:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [05:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:06:30] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) [05:07:05] 10SRE, 10DBA, 10Security: Reboot dbproxy for kernel upgrades - https://phabricator.wikimedia.org/T310484 (10Marostegui) 05Open→03Resolved All done ` ===== NODE GROUP ===== (12) dbproxy[2001-2004].codfw.wmnet,dbproxy[1012-1017,1020-1021].eqiad.wmnet ----- OUTPUT of 'sudo uname -v' ----- #1 SMP Debian 5.10... [05:07:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: host reimage [05:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:36] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:42] (03PS1) 10Marostegui: es2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805503 (https://phabricator.wikimedia.org/T310485) [05:12:06] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:47] (03CR) 10Marostegui: [C: 03+2] es2*: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805503 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui) [05:14:06] (03PS1) 10Marostegui: pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805504 (https://phabricator.wikimedia.org/T310485) [05:14:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:31] (03CR) 10Marostegui: [C: 03+2] pc1011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/805504 (https://phabricator.wikimedia.org/T310485) (owner: 10Marostegui) [05:16:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:17:10] !log dbmaint es1@codfw T310485 [05:17:12] !log dbmaint es2@codfw T310485 [05:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:14] !log dbmaint es3@codfw T310485 [05:17:15] !log dbmaint es4@codfw T310485 [05:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:17] !log dbmaint es5@codfw T310485 [05:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:18:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:45] (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:21:36] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10Marostegui) p:05High→03Medium Thank you so much @Cmjohnson! I can indeed access the host now and I have reimaged it sucessfully. Decreasing the priority since the initial issue was triaged (so fast!). So once t... [05:23:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1173.eqiad.wmnet with OS bullseye [05:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:34:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [05:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29745 and previous config saved to /var/cache/conftool/dbconfig/20220615-054252-marostegui.json [05:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:57] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [05:49:10] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:12] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:36] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:02] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:30] PROBLEM - MariaDB Replica IO: s8 on db2084 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:00:35] ^ me [06:00:52] PROBLEM - MariaDB Replica IO: s8 on db2080 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:00:56] PROBLEM - MariaDB Replica IO: s8 on db2152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2079.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2079.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:01:44] RECOVERY - MariaDB Replica IO: s8 on db2084 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:01:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:01:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [06:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29746 and previous config saved to /var/cache/conftool/dbconfig/20220615-060153-marostegui.json [06:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:57] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:02:12] RECOVERY - MariaDB Replica IO: s8 on db2152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:02:35] !log Reboot db[2071-2078] T310485 [06:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:46] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:32] RECOVERY - MariaDB Replica IO: s8 on db2080 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:04:12] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:09:09] (03PS1) 10Marostegui: Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805437 [06:10:06] (03CR) 10Marostegui: [C: 03+2] Revert "pc1011: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805437 (owner: 10Marostegui) [06:10:32] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:11:00] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [06:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [06:12:16] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:27] (03PS1) 10Marostegui: Revert "es2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805438 [06:13:30] (03CR) 10Marostegui: [C: 03+2] Revert "es2*: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/805438 (owner: 10Marostegui) [06:14:02] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:22:43] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:05] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:24:09] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. An error occured trying to list the failed units https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:23] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:51] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:05] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29747 and previous config saved to /var/cache/conftool/dbconfig/20220615-063837-marostegui.json [06:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:44] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:42:05] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:46] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:58] !log disable BGP to Telia in eqsin for optic replacement - T300485 [06:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:03] T300485: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 [06:53:28] a prometheus job will complain temporarilly while I reboot the bacula director [06:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29748 and previous config saved to /var/cache/conftool/dbconfig/20220615-065342-marostegui.json [06:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:57] bacula metrics should be back up [06:57:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T0700). Please do the needful. [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:05:28] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:08:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P29749 and previous config saved to /var/cache/conftool/dbconfig/20220615-070847-marostegui.json [07:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:58] PROBLEM - Check systemd state on webperf2002 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.timer,excimer-k8s-log.service,excimer-k8s-wall-log.service,excimer-wall-log.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:12] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:12:24] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:10] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29750 and previous config saved to /var/cache/conftool/dbconfig/20220615-071728-root.json [07:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 5%: After schema change', diff saved to https://phabricator.wikimedia.org/P29751 and previous config saved to /var/cache/conftool/dbconfig/20220615-072034-root.json [07:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29752 and previous config saved to /var/cache/conftool/dbconfig/20220615-072352-marostegui.json [07:23:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:23:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:58] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:32] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:25] (03PS3) 10Slyngshede: Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055) [07:32:02] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29753 and previous config saved to /var/cache/conftool/dbconfig/20220615-073232-root.json [07:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:34:30] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:35:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P29754 and previous config saved to /var/cache/conftool/dbconfig/20220615-073538-root.json [07:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:32] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:39:11] (03PS2) 10KartikMistry: testwiki: Enable SectionTranslation for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805370 (https://phabricator.wikimedia.org/T309384) [07:43:12] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:43:19] 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10SLyngshede-WMF) p:05Triage→03Low [07:44:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:44:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:22] 10SRE, 10ops-codfw: Degraded RAID on aqs2005 - https://phabricator.wikimedia.org/T310610 (10SLyngshede-WMF) p:05Low→03Medium [07:46:10] 10SRE, 10ops-eqiad: cloudstore1008 - eno2 reporting no carrier - https://phabricator.wikimedia.org/T309885 (10SLyngshede-WMF) p:05Triage→03Medium [07:47:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29755 and previous config saved to /var/cache/conftool/dbconfig/20220615-074736-root.json [07:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:02] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:13] (03CR) 10Slyngshede: [C: 03+2] Update email address for goransm. [puppet] - 10https://gerrit.wikimedia.org/r/805389 (https://phabricator.wikimedia.org/T310055) (owner: 10Slyngshede) [07:49:36] (03PS1) 10KartikMistry: Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266) [07:50:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:50:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1099.eqiad.wmnet with reason: Maintenance [07:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29756 and previous config saved to /var/cache/conftool/dbconfig/20220615-075024-marostegui.json [07:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:28] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:50:42] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P29757 and previous config saved to /var/cache/conftool/dbconfig/20220615-075042-root.json [07:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:38] (03PS3) 10Slyngshede: Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy) [07:53:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy) [07:53:48] (03CR) 10Slyngshede: [C: 03+2] Allow deployers to sudo -u mwpresync [puppet] - 10https://gerrit.wikimedia.org/r/805464 (https://phabricator.wikimedia.org/T310654) (owner: 10Ahmon Dancy) [07:54:34] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:41] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩️): Allow deployers to sudo -u mwpresync - https://phabricator.wikimedia.org/T310654 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03High a:03SLyngshede-WMF [07:57:10] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:41] 10SRE, 10serviceops, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Triage→03Medium [07:58:04] 10SRE, 10MediaWiki-General, 10Traffic: Query canonicalization for MediaWiki - https://phabricator.wikimedia.org/T310087 (10SLyngshede-WMF) p:05Triage→03Medium [07:59:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 244 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:00:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [08:01:28] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:02:26] (03PS1) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805733 [08:02:28] (03PS1) 10Muehlenhoff: Retire profile::logster_alarm [puppet] - 10https://gerrit.wikimedia.org/r/805734 [08:02:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29758 and previous config saved to /var/cache/conftool/dbconfig/20220615-080240-root.json [08:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:27] !log re-enable BGP to Telia in eqsin for optic replacement - T300485 [08:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:30] T300485: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 [08:03:44] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P29759 and previous config saved to /var/cache/conftool/dbconfig/20220615-080546-root.json [08:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:06] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:09:01] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: only run CSV dumps on active server [puppet] - 10https://gerrit.wikimedia.org/r/805468 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [08:09:17] (03CR) 10Slyngshede: [C: 03+1] "LGTM, unless we want to ensure that the redundant files are removed first." [puppet] - 10https://gerrit.wikimedia.org/r/805734 (owner: 10Muehlenhoff) [08:10:44] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (26) node(s) change every puppet run: an-tool1009, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001 [08:10:44] -fe2002, thanos-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [08:12:24] (03CR) 10Slyngshede: [C: 03+1] "Looks good, but I think we can just remove the custom parser as well." [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff) [08:12:42] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:02] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:24] (03PS1) 10Awight: Remove $wgVisualEditorTransclusionDialogBackButton feature flag [extensions/Cite] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805748 (https://phabricator.wikimedia.org/T307188) [08:17:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29760 and previous config saved to /var/cache/conftool/dbconfig/20220615-081744-root.json [08:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:02] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:00] (03PS1) 10Jaime Nuche: scap: remove config for scap Debian package [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559) [08:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:20:38] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:20:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P29761 and previous config saved to /var/cache/conftool/dbconfig/20220615-082050-root.json [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:04] (03PS12) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [08:21:18] (03CR) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [08:22:14] !log jnuche@deploy1002 Installing scap version "4.9.3" for 557 hosts [08:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:34] !log jnuche@deploy1002 Installation of scap version "4.9.3" completed for 557 hosts [08:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:47] !log jnuche@deploy1002 Installing scap version "4.9.3" for 557 hosts [08:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:07] !log jnuche@deploy1002 Installation of scap version "4.9.3" completed for 557 hosts [08:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:38] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:04] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:52] (03CR) 10Muehlenhoff: Retire profile::logster_alarm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff) [08:27:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29762 and previous config saved to /var/cache/conftool/dbconfig/20220615-082734-marostegui.json [08:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:39] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:28:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:02] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:32] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I can confirm. 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805433 (owner: 10DannyS712) [08:32:42] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:06] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:40] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:13] 10SRE, 10serviceops, 10Patch-For-Review: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10SLyngshede-WMF) p:05Medium→03High [08:35:38] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P29763 and previous config saved to /var/cache/conftool/dbconfig/20220615-083554-root.json [08:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] 10SRE, 10Thumbor, 10Traffic: Thumbor URLs are too permissive - https://phabricator.wikimedia.org/T310528 (10SLyngshede-WMF) p:05Triage→03Medium [08:38:14] 10SRE, 10Traffic, 10serviceops: fawiki user reports getting 503 errors with message "upstream connect error or disconnect before headers" - https://phabricator.wikimedia.org/T310450 (10SLyngshede-WMF) p:05Triage→03Medium [08:39:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:21] (03CR) 10Volans: [C: 04-1] "Approach looks good, some minor things to fix inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [08:40:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:40:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:40:42] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29764 and previous config saved to /var/cache/conftool/dbconfig/20220615-084046-marostegui.json [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:51] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [08:41:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29765 and previous config saved to /var/cache/conftool/dbconfig/20220615-084151-marostegui.json [08:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:12] (03CR) 10Thiemo Kreuz (WMDE): "Personally, I find this one of the more questionable rules we have. The benefit is small, especially when I use an IDE that clearly separa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805432 (https://phabricator.wikimedia.org/T171115) (owner: 10DannyS712) [08:42:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29766 and previous config saved to /var/cache/conftool/dbconfig/20220615-084239-marostegui.json [08:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:50] (03PS1) 10Ayounsi: Netbox: remove Icinga checks for netbox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) [08:45:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:45:40] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:54] (03PS2) 10David Caro: wmcs: move alerting code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805376 (https://phabricator.wikimedia.org/T309786) [08:47:56] (03PS2) 10David Caro: wmcs.ceph.upgrade*: add sal logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805377 (https://phabricator.wikimedia.org/T309786) [08:47:58] (03PS1) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) [08:48:03] (03PS1) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 [08:50:04] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:51] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) p:05Triage→03Medium [08:50:55] (03Abandoned) 10Thiemo Kreuz (WMDE): Remove $wgVisualEditorTransclusionDialogBackButton feature flag [extensions/Cite] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805748 (https://phabricator.wikimedia.org/T307188) (owner: 10Awight) [08:51:02] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:41] 10SRE, 10serviceops: Requesting SSH keypair for deployment server keyholder to push to Gerrit - https://phabricator.wikimedia.org/T310620 (10SLyngshede-WMF) This doesn't appear to be an SRE-Access-Request. Adding the ServiceOps tags, as they are involved in the Kubernetes migration and it makes sense to loop t... [08:51:51] 10SRE, 10SRE-OnFire, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10SLyngshede-WMF) p:05Triage→03Medium [08:53:22] (03CR) 10CI reject: [V: 04-1] wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [08:53:26] (03CR) 10CI reject: [V: 04-1] wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 (owner: 10David Caro) [08:56:01] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10SLyngshede-WMF) 05Open→03Resolved p:05Triage→03Medium Email address is updated, everything else looks fine. [08:56:04] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:40] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [08:56:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29767 and previous config saved to /var/cache/conftool/dbconfig/20220615-085656-marostegui.json [08:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Gonna -1 this one for now to avoid accidental merging while some clarifications and communications are ongoing regarding this one." [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537) (owner: 10Brennen Bearnes) [08:57:38] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P29768 and previous config saved to /var/cache/conftool/dbconfig/20220615-085744-marostegui.json [08:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:08] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:02:46] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:08] PROBLEM - Host ms-be1059 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:19] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10ayounsi) We tried: * New optic * New patch cable * New router port And the errors are still present. Next step is to follow up with Telia so they check their side and then the DC for the X-connect. [09:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:06:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:07:25] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:07:48] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:15] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Xcollazo - https://phabricator.wikimedia.org/T310385 (10SLyngshede-WMF) 05Open→03Resolved [09:09:48] RECOVERY - Host ms-be1059 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [09:10:39] (03CR) 10Awight: "Cherry-pick for backport." [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight) [09:12:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Restore internal mechanism to use either back or close button [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight) [09:12:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P29769 and previous config saved to /var/cache/conftool/dbconfig/20220615-091201-marostegui.json [09:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:12:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T310011)', diff saved to https://phabricator.wikimedia.org/P29770 and previous config saved to /var/cache/conftool/dbconfig/20220615-091249-marostegui.json [09:12:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10joanna_borun) [09:12:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:12:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [09:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:54] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29771 and previous config saved to /var/cache/conftool/dbconfig/20220615-091257-marostegui.json [09:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:06] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:29] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [09:14:56] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1059.eqiad.wmnet with OS bullseye [09:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:00] (03PS6) 10Klausman: service::catalog: Add inference-staging service [puppet] - 10https://gerrit.wikimedia.org/r/805329 (https://phabricator.wikimedia.org/T302195) [09:15:04] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1059.eqiad.wmnet with OS bullseye [09:15:08] (03PS8) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [09:16:55] vgutierrez: ack enjoy :) [09:17:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:17:43] (03PS2) 10David Caro: wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) [09:17:45] (03PS2) 10David Caro: wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 [09:17:47] (03CR) 10Jbond: "update thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [09:18:03] (03PS1) 10Mainframe98: Fix deletion of translation pages outside of NS_MAIN namespace [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440) [09:19:40] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:32] !log Reboot sanitarium hosts (db1154, db1155) wiki replicas will have lag [09:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [09:22:50] (03CR) 10CI reject: [V: 04-1] wmcs.alert/ceph: allow downtiming alerts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805742 (owner: 10David Caro) [09:23:38] (03CR) 10CI reject: [V: 04-1] wmcs.ceph: move core code to a library [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/805741 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [09:24:45] (JobUnavailable) firing: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29772 and previous config saved to /var/cache/conftool/dbconfig/20220615-092706-marostegui.json [09:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:27:12] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [09:27:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:27:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Maintenance [09:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] (03CR) 10Volans: [C: 04-1] "In general LGTM, couple of minor comments in line." [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [09:29:45] (JobUnavailable) resolved: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:32:34] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [09:32:48] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) 05Resolved→03Open Hi, Sorry to reopen this task, but there's a licence issue, I think. When I try and use the HTML5 console, it says: "License Required This iLO is... [09:39:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, netboxdb2001, netboxdb2002, puppetdb2002, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe20 [09:39:21] os-fe2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [09:40:41] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable link recommendation on aswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805766 [09:44:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4001.ulsfo.wmnet [09:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] (03PS1) 10Marostegui: change_localuser.lu_attached_timestamp_T302659.py: Replaced the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) [09:44:50] (03PS2) 10Marostegui: change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) [09:45:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29773 and previous config saved to /var/cache/conftool/dbconfig/20220615-094532-marostegui.json [09:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:38] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:46:57] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:47:56] (03CR) 10Marostegui: [C: 03+2] change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) (owner: 10Marostegui) [09:48:03] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:52] (03Merged) 10jenkins-bot: change_localuser.lu_attached_timestamp_T302659.py: Replace the check [software/schema-changes] - 10https://gerrit.wikimedia.org/r/805767 (https://phabricator.wikimedia.org/T302659) (owner: 10Marostegui) [09:49:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4001.ulsfo.wmnet [09:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:52:41] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:25] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:37] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:43] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:02] (03CR) 10Volans: [C: 04-1] "Minor comments inline, one need to be fixed, would not work as is." [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:59:14] (03CR) 10Volans: [C: 04-1] "clarified comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [10:00:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29774 and previous config saved to /var/cache/conftool/dbconfig/20220615-100037-marostegui.json [10:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:53] hnowlan: are the maps critical for prometheus-pg-replication-lag.service known ? seems to be flip/flopping [10:02:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:02:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29775 and previous config saved to /var/cache/conftool/dbconfig/20220615-100235-marostegui.json [10:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:40] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [10:03:44] godog: a check was broken yesterday, there's a fix in review - will try to move it along [10:03:55] (03CR) 10Volans: "One missed thing I think, then LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [10:05:13] hnowlan: ah! ack, thanks [10:06:27] (03CR) 10Volans: "clarification comment" [puppet] - 10https://gerrit.wikimedia.org/r/779531 (https://phabricator.wikimedia.org/T305589) (owner: 10Dzahn) [10:07:07] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:27] (03CR) 10Volans: "I don't mind it, LGTM. I'm just a little bit worries that this will take people by surprise with the additional diff. We should communicat" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:08:49] (03CR) 10Filippo Giunchedi: "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [10:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:13:45] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "I'm adding Ben re: the analytics/data-engineering question and how these alerts should reach which team, HTH!" [alerts] - 10https://gerrit.wikimedia.org/r/805237 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [10:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P29776 and previous config saved to /var/cache/conftool/dbconfig/20220615-101543-marostegui.json [10:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:03] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:22:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:13] (03CR) 10JMeybohm: [C: 04-1] "This would leave failing test jobs around forever (requiring manual cleanup)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 (owner: 10Ahmon Dancy) [10:25:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Hi, Sorry for further noise, but: is it possible there's some USB thing still connected to this system? It won't reimage because of: ` [ 12.234765] sd 0:0:0:0: [sda] A... [10:27:17] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:39] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1029 es1030 es1028 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29777 and previous config saved to /var/cache/conftool/dbconfig/20220615-102929-root.json [10:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T310011)', diff saved to https://phabricator.wikimedia.org/P29778 and previous config saved to /var/cache/conftool/dbconfig/20220615-103048-marostegui.json [10:30:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:30:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [10:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:30:53] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29779 and previous config saved to /var/cache/conftool/dbconfig/20220615-103101-marostegui.json [10:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:07] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:17] (03PS3) 10Filippo Giunchedi: am: retry on CGI failure or empty output [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) [10:36:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29780 and previous config saved to /var/cache/conftool/dbconfig/20220615-103608-root.json [10:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29781 and previous config saved to /var/cache/conftool/dbconfig/20220615-103615-root.json [10:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:17] (03CR) 10Filippo Giunchedi: "Thank you for the reviews David!" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/805384 (https://phabricator.wikimedia.org/T310331) (owner: 10Filippo Giunchedi) [10:39:22] (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [10:39:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29782 and previous config saved to /var/cache/conftool/dbconfig/20220615-103933-root.json [10:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:51] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:37] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 2 others: Create a spicerack cookbook for restoring an etcd cluster from backups - https://phabricator.wikimedia.org/T203944 (10joanna_borun) [10:42:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Covert deploy_apache_change.sh to a spicerack cookbook - https://phabricator.wikimedia.org/T203948 (10joanna_borun) [10:43:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10joanna_borun) [10:43:57] (03PS2) 10Aklapper: Redirect dev.wikimedia.org to developer.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) [10:44:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10joanna_borun) [10:44:33] (03CR) 10Jbond: Make SREBatchBase operate on host groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [10:45:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 4 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10joanna_borun) [10:46:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create Spicerack cookbook to drain/reboot/uncordon a Kubernetes worker - https://phabricator.wikimedia.org/T212866 (10joanna_borun) [10:46:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:46:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:40] 10SRE, 10SRE-tools, 10Discovery-Search, 10Spicerack, and 2 others: Migrate elasticsearch scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T202885 (10joanna_borun) [10:46:58] 10SRE, 10SRE-tools, 10Elasticsearch, 10Spicerack, and 2 others: Refactor current code base to support multiple elasticsearch instances/multiple elasticsearch clusters - https://phabricator.wikimedia.org/T207918 (10joanna_borun) [10:47:02] (03CR) 10Aklapper: [C: 03+1] "Please merge." [puppet] - 10https://gerrit.wikimedia.org/r/791321 (https://phabricator.wikimedia.org/T265018) (owner: 10Aklapper) [10:47:22] 10SRE, 10SRE-tools, 10Spicerack, 10Discovery-Search (Current work), 10Patch-For-Review: Write cookbooks to support spicerack's elasticsearch multi cluster/instance - https://phabricator.wikimedia.org/T207919 (10joanna_borun) [10:47:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Maps, and 3 others: Create cookbook for postgres initialization on maps cluster - https://phabricator.wikimedia.org/T220946 (10joanna_borun) [10:47:39] 10SRE, 10SRE-tools, 10Discovery-Search, 10Spicerack: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10joanna_borun) [10:47:53] 10SRE, 10SRE-tools, 10Elasticsearch, 10Spicerack, and 2 others: Test spicerack elasticsearch module - https://phabricator.wikimedia.org/T207920 (10joanna_borun) [10:47:59] 10SRE, 10SRE-tools, 10Spicerack, 10Wikidata, and 2 others: Create Cookbook to restart WDQS - https://phabricator.wikimedia.org/T221832 (10joanna_borun) [10:48:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10User-Joe: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10joanna_borun) [10:48:41] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, and 3 others: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10joanna_borun) [10:48:49] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Maps, and 3 others: Create cookbook to reboot Maps - https://phabricator.wikimedia.org/T224072 (10joanna_borun) [10:48:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Create a cookbook to restart the jvms on a Cassandra cluster - https://phabricator.wikimedia.org/T230022 (10joanna_borun) [10:49:02] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: More structured cookbooks to reboot hosts - https://phabricator.wikimedia.org/T252807 (10joanna_borun) [10:49:30] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 4 others: Create a cookbook to perform a rolling reboot of a kubernetes cluster - https://phabricator.wikimedia.org/T260661 (10joanna_borun) [10:49:50] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10joanna_borun) [10:49:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29783 and previous config saved to /var/cache/conftool/dbconfig/20220615-105112-root.json [10:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29784 and previous config saved to /var/cache/conftool/dbconfig/20220615-105119-root.json [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:52] (03PS1) 10Btullis: Enable CAS-SSO for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) [10:54:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29786 and previous config saved to /var/cache/conftool/dbconfig/20220615-105437-root.json [10:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:53] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis) [10:54:57] !log dbmaint es1@eqiad T310485 [10:54:59] !log dbmaint es2@eqiad T310485 [10:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:01] !log dbmaint es3@eqiad T310485 [10:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis) [10:58:35] (03CR) 10Btullis: [C: 03+2] Enable CAS-SSO for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805773 (https://phabricator.wikimedia.org/T310686) (owner: 10Btullis) [11:01:07] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:09] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10SLyngshede-WMF) p:05Triage→03High @CMyrick-WMF I have added you to the WMF LDAP group, that should grant you access to Superset and Turnilo. It does not grant you access t... [11:01:27] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset and Tunilo for Caroline Myrick - https://phabricator.wikimedia.org/T310524 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [11:04:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29787 and previous config saved to /var/cache/conftool/dbconfig/20220615-110435-marostegui.json [11:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:40] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:06:13] (03CR) 10D3r1ck01: Use a service locator to get a job runner (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [11:06:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29788 and previous config saved to /var/cache/conftool/dbconfig/20220615-110616-root.json [11:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29789 and previous config saved to /var/cache/conftool/dbconfig/20220615-110623-root.json [11:06:25] (03Abandoned) 10D3r1ck01: Use a service locator to get a job runner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/793837 (owner: 10D3r1ck01) [11:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:47] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29790 and previous config saved to /var/cache/conftool/dbconfig/20220615-110940-root.json [11:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:05] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29791 and previous config saved to /var/cache/conftool/dbconfig/20220615-111940-marostegui.json [11:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:49] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29792 and previous config saved to /var/cache/conftool/dbconfig/20220615-112119-root.json [11:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29793 and previous config saved to /var/cache/conftool/dbconfig/20220615-112127-root.json [11:21:30] (03CR) 10Ayounsi: [C: 03+2] Netbox: remove Icinga checks for netbox.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/805740 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [11:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:21] (03CR) 10D3r1ck01: "We discussed about this yesterday in PET technical discussion and we agreed to make a patch removing this file." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805775 (https://phabricator.wikimedia.org/T175146) (owner: 10D3r1ck01) [11:23:06] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good and PCC checks out as well, merging: https://puppet-compiler.wmflabs.org/pcc-worker1003/35854/" [puppet] - 10https://gerrit.wikimedia.org/r/805736 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [11:23:19] (03PS1) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 [11:24:18] XioNoX: shall I merge your Icinga patch along? [11:24:28] moritzm: yeah I was about to [11:24:31] thanks [11:24:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29794 and previous config saved to /var/cache/conftool/dbconfig/20220615-112444-root.json [11:24:46] ack, doing that now [11:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:56] merged [11:25:33] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:25:51] (03PS2) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:26:38] thx, checking [11:26:47] (03CR) 10Slyngshede: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:27:00] (03CR) 10CI reject: [V: 04-1] P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:28:36] (03PS3) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:29:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29795 and previous config saved to /var/cache/conftool/dbconfig/20220615-112924-marostegui.json [11:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:30] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [11:31:13] (03PS1) 10Majavah: wmcs: neutron: use min_over_time [alerts] - 10https://gerrit.wikimedia.org/r/805783 [11:34:24] (03PS1) 10Slyngshede: Grant access to analytics_privatedata_users to user ricby [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227) [11:34:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P29796 and previous config saved to /var/cache/conftool/dbconfig/20220615-113445-marostegui.json [11:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29797 and previous config saved to /var/cache/conftool/dbconfig/20220615-113623-root.json [11:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29798 and previous config saved to /var/cache/conftool/dbconfig/20220615-113631-root.json [11:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:22] (03PS4) 10Hnowlan: P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:38:24] (03CR) 10CI reject: [V: 04-1] redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond) [11:39:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29799 and previous config saved to /var/cache/conftool/dbconfig/20220615-113948-root.json [11:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35859/console" [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:41:09] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:41:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35860/console" [puppet] - 10https://gerrit.wikimedia.org/r/804800 (owner: 10Legoktm) [11:42:52] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] P:maps::osm_replica fix prom-replication lag script. [puppet] - 10https://gerrit.wikimedia.org/r/805381 (owner: 10Slyngshede) [11:44:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29800 and previous config saved to /var/cache/conftool/dbconfig/20220615-114430-marostegui.json [11:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:14] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:48] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:40] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:22] RECOVERY - Check systemd state on maps1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:46] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:36] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:36] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T310011)', diff saved to https://phabricator.wikimedia.org/P29801 and previous config saved to /var/cache/conftool/dbconfig/20220615-114950-marostegui.json [11:49:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:49:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:55] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29802 and previous config saved to /var/cache/conftool/dbconfig/20220615-115127-root.json [11:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1029 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29803 and previous config saved to /var/cache/conftool/dbconfig/20220615-115135-root.json [11:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:10] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:10] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:28] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29804 and previous config saved to /var/cache/conftool/dbconfig/20220615-115452-root.json [11:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:41] (03PS2) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 [11:57:49] (03PS3) 10Jbond: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 [11:59:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P29805 and previous config saved to /var/cache/conftool/dbconfig/20220615-115935-marostegui.json [11:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227) (owner: 10Slyngshede) [12:00:41] (03CR) 10Slyngshede: [C: 03+2] Grant access to analytics_privatedata_users to user ricby [puppet] - 10https://gerrit.wikimedia.org/r/805785 (https://phabricator.wikimedia.org/T310227) (owner: 10Slyngshede) [12:00:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5001.eqsin.wmnet [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:11] 10SRE, 10LDAP-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Ricardo Baeza-Yates - https://phabricator.wikimedia.org/T310227 (10SLyngshede-WMF) p:05Triage→03High @Leila I've added Ricardo to the analytics-privatedata-users users group, and the NDA group in L... [12:06:55] (03PS6) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [12:06:57] (03PS3) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) [12:06:59] (03PS3) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) [12:07:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5001.eqsin.wmnet [12:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:52] * kart_ updating cxserver; no major changes. [12:10:31] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266) (owner: 10KartikMistry) [12:14:12] (03Merged) 10jenkins-bot: Update cxserver to 2022-06-15-074244-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/805726 (https://phabricator.wikimedia.org/T309266) (owner: 10KartikMistry) [12:14:14] (03PS4) 10Slyngshede: Ganeti Prometheus exporter, initial checkin [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/804276 [12:14:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T302659)', diff saved to https://phabricator.wikimedia.org/P29806 and previous config saved to /var/cache/conftool/dbconfig/20220615-121440-marostegui.json [12:14:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:14:44] T302659: Adjust the field type of localuser.lu_attached_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T302659 [12:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:16:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29807 and previous config saved to /var/cache/conftool/dbconfig/20220615-121620-marostegui.json [12:16:22] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:24] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:30] (03CR) 10Joal: [C: 03+1] "LGTM - Could either ottomata or btullis merge this please?" [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [12:16:32] Amir1: apergos: hihi, will be at the UTC early training tomorrow - I have nothing to deploy though :) [12:16:57] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] ok, thanks for the heads up! [12:18:34] (03CR) 10Btullis: [C: 03+2] Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [12:19:17] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] (03PS8) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [12:19:42] (03PS22) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [12:19:59] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:21:02] (03CR) 10JMeybohm: Make SREBatchBase operate on host groups (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [12:21:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1032 es1033 es1034 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P29808 and previous config saved to /var/cache/conftool/dbconfig/20220615-122123-root.json [12:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:36] (03CR) 10CI reject: [V: 04-1] Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 (owner: 10JMeybohm) [12:23:06] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:11] (03CR) 10CI reject: [V: 04-1] Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:23:24] TheresNoTime: I could offer https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/628773 as a no-op dead code cleanup if you want something to deploy :P [12:23:33] (the rest of the chain is blocked, but the first change should be okay) [12:23:50] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] Lucas_WMDE: works for me, thank you! but I'll defer to those doing the training ^^' [12:24:19] ^^ [12:25:39] !log Updated cxserver to 2022-06-15-074244-production (T309266, T310116, T309384, T306963) [12:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:48] T310116: Enable Section Translation in Uzbek Wikipedia - https://phabricator.wikimedia.org/T310116 [12:25:48] T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384 [12:25:48] T306963: Integrate new section mapping database - https://phabricator.wikimedia.org/T306963 [12:25:49] T309266: Adjust default MT services for pairs where the default is not the most used - https://phabricator.wikimedia.org/T309266 [12:26:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5002.eqsin.wmnet [12:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:24] (03PS9) 10JMeybohm: Make SREBatchBase operate on host groups [cookbooks] - 10https://gerrit.wikimedia.org/r/802811 [12:30:26] (03PS23) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [12:34:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5002.eqsin.wmnet [12:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:33] Hello! hello! we're going to start the Netbox upgrade now, please refrain from using it either directly or via cookbook (makevm, decom, provision, etc) if you have a doubt feel free to ask [12:38:03] godspeed XioNoX and others involved [12:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29810 and previous config saved to /var/cache/conftool/dbconfig/20220615-123938-root.json [12:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29811 and previous config saved to /var/cache/conftool/dbconfig/20220615-123943-root.json [12:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29812 and previous config saved to /var/cache/conftool/dbconfig/20220615-123949-root.json [12:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:16] (03PS1) 10Jbond: SREBaseClass: Allow overriding actions [cookbooks] - 10https://gerrit.wikimedia.org/r/805807 [12:42:31] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on netbox:443 with reason: Netbox upgrade to 3.2 T296452 [12:42:32] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 6:00:00 on netbox:443 with reason: Netbox upgrade to 3.2 T296452 [12:42:34] !log failover ganeti master in eqsin to ganeti5001 [12:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:37] T296452: Upgrade Netbox to 3.2 - https://phabricator.wikimedia.org/T296452 [12:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:49] (03CR) 10Slyngshede: [C: 03+1] Retire profile::logster_alarm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/805733 (owner: 10Muehlenhoff) [12:47:00] (03PS24) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [12:47:12] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [12:48:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29813 and previous config saved to /var/cache/conftool/dbconfig/20220615-124810-marostegui.json [12:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:15] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:48:30] !log jbond@deploy1002 Started deploy [netbox/deploy@7bbf659]: log [12:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:00] PROBLEM - ganeti-wconfd running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:51:42] !log jbond@deploy1002 Finished deploy [netbox/deploy@7bbf659]: log (duration: 03m 12s) [12:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:04] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29815 and previous config saved to /var/cache/conftool/dbconfig/20220615-125442-root.json [12:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29816 and previous config saved to /var/cache/conftool/dbconfig/20220615-125447-root.json [12:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29817 and previous config saved to /var/cache/conftool/dbconfig/20220615-125452-root.json [12:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.11.12 [12:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:21] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.11.12 (duration: 00m 05s) [12:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service,netbox_ganeti_drmrs01_sync.service,netbox_ganeti_eqiad_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_ulsfo_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:47] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v2.11.12 [12:55:48] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:12] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [12:56:45] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v2.11.12 (duration: 00m 58s) [12:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:03] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [12:57:12] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:57:39] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [12:57:51] (03PS2) 10Ayounsi: Netbox: Add 2.11 configuration knobs [puppet] - 10https://gerrit.wikimedia.org/r/790400 (https://phabricator.wikimedia.org/T296452) [12:58:42] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10MoritzMuehlenhoff) I also removed logsteralarms@ earlier the day, it's no longer needed. [12:59:33] (03CR) 10David Caro: [C: 03+1] "LGTM, feel free to ignore the nits" [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:00:02] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1300). [13:00:05] awight and mainframe98: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:12] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox1002.eqiad.wmnet with reason: Netbox upgrade to 3.2 [13:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox1002.eqiad.wmnet with reason: Netbox upgrade to 3.2 [13:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:17] o/ [13:00:35] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on netbox2002.codfw.wmnet with reason: Netbox upgrade to 3.2 [13:00:37] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on netbox2002.codfw.wmnet with reason: Netbox upgrade to 3.2 [13:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:41] Lucas_WMDE: you were here earlier, do you plan to deploy? [13:00:42] I can deploy my patch, and happy to take care of mainframe98's if you wish? [13:00:56] awight: feel free to self-service [13:00:57] (03CR) 10Urbanecm: [C: 03+1] "LGTM. I think we can deploy this at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:01:04] and if you also want to deploy mainframe98’s patch that’s fine by me :) [13:01:12] awight: I'd appreciate that, thank you [13:01:17] great! [13:01:39] * mainframe98 goes to scrounge up a test case on mediawiki.org to use as test [13:02:21] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.1 [13:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:45] (03CR) 10Awight: [C: 03+2] "Backport deployment." [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight) [13:03:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29818 and previous config saved to /var/cache/conftool/dbconfig/20220615-130315-marostegui.json [13:03:18] (03CR) 10Urbanecm: [C: 04-1] "actually, can you also add it to production's IS.php (with default => false)? it should work as-is, but adding to IS.php is a good practic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:03] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.1 (duration: 01m 43s) [13:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:47] (03PS4) 10Jbond: Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:05:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: systemd.timer not executing on cumin2001 after command was modified - https://phabricator.wikimedia.org/T268974 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [13:07:08] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:08:50] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.1 [13:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29819 and previous config saved to /var/cache/conftool/dbconfig/20220615-130946-root.json [13:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29820 and previous config saved to /var/cache/conftool/dbconfig/20220615-130951-root.json [13:09:53] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.1 (duration: 01m 03s) [13:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29821 and previous config saved to /var/cache/conftool/dbconfig/20220615-130956-root.json [13:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:42] (03PS7) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) [13:10:44] (03PS4) 10Majavah: sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) [13:10:46] (03PS4) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) [13:10:59] (03CR) 10Ayounsi: [C: 03+2] Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:11:05] (03CR) 10Majavah: sonofgridengine: grid_configurator: filter 'normal' stderr output (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:11:07] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) @MatthewVernon We do not have the licenses for Integrated Remote Console. As for a USB key attached, there is not anything, I turned off the internal USB, can you try agai... [13:11:21] (03CR) 10Majavah: sonofgridengine: grid_configurator: remove hosts entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:11:22] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:12:35] (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: remove hosts entries [puppet] - 10https://gerrit.wikimedia.org/r/801777 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:12:48] (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: remove hostgroup and queue entries [puppet] - 10https://gerrit.wikimedia.org/r/801774 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:13:16] (03PS2) 10Volans: netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond) [13:13:26] (03CR) 10David Caro: [C: 03+2] sonofgridengine: grid_configurator: filter 'normal' stderr output [puppet] - 10https://gerrit.wikimedia.org/r/801770 (https://phabricator.wikimedia.org/T309525) (owner: 10Majavah) [13:14:13] (03CR) 10CI reject: [V: 04-1] netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond) [13:15:40] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:16:14] (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond) [13:16:33] (03PS6) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [13:18:06] (03CR) 10Jbond: [C: 03+2] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [13:18:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P29822 and previous config saved to /var/cache/conftool/dbconfig/20220615-131820-marostegui.json [13:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:48] (03CR) 10CI reject: [V: 04-1] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [13:19:36] (03CR) 10CI reject: [V: 04-1] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:19:40] (03PS8) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [13:19:48] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:20:21] (03CR) 10CI reject: [V: 04-1] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [13:20:23] (03PS7) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [13:20:48] mainframe98: I'm still holding for CI, but wanted to ask you if your Translate patch is something that can be tested on test.wikipedia.org once it's merged? [13:21:07] (03PS8) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) [13:21:22] (03CR) 10Jbond: [V: 03+2] netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 (owner: 10Jbond) [13:21:24] (03Merged) 10jenkins-bot: Restore internal mechanism to use either back or close button [extensions/VisualEditor] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805745 (https://phabricator.wikimedia.org/T310602) (owner: 10Awight) [13:21:24] mainframe98: I think this is as far as the train has proceeded for wmf.16 [13:21:37] awight: yes, but I don't have the required permissions. Is mediawiki.org an option? [13:22:05] Also, I'm not sure if that wiki has a translatable page that would error [13:22:09] mainframe98: Unfortunately not, it's at wmf.15: https://www.mediawiki.org/wiki/Special:Version [13:22:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [13:22:19] 10SRE, 10ops-eqiad, 10DC-Ops: Recycling Pickup for EQIAD - https://phabricator.wikimedia.org/T307140 (10Cmjohnson) Per my discussion with @wiki_willy we are keeping all juniper gear for future donations. [13:22:33] I'm okay with deploying blindly, if you can track versions and test once that becomes possible? [13:22:52] (This would be the same as if your patch had been merged before the wmf.16 branch.) [13:22:59] (03CR) 10Majavah: P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:23:00] Sure, I use that feature daily to fight vandalism; if it breaks, I'll know [13:23:05] :-) ty! [13:24:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29823 and previous config saved to /var/cache/conftool/dbconfig/20220615-132450-root.json [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29824 and previous config saved to /var/cache/conftool/dbconfig/20220615-132454-root.json [13:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29825 and previous config saved to /var/cache/conftool/dbconfig/20220615-132500-root.json [13:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:13] (03PS2) 10Volans: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:25:51] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35862/console" [puppet] - 10https://gerrit.wikimedia.org/r/802104 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah) [13:26:15] (03CR) 10CI reject: [V: 04-1] ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:26:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:52] Ooof scap crashed with "sync-file failed: 'Namespace' object has no attribute 'pause_after_testserver_sync'" [13:27:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:43] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.2 [13:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:46] This was mentioned at https://phabricator.wikimedia.org/P29785 but there's no task yet? [13:28:55] hnowlan: ^ did you find a workaround? [13:28:59] (03PS1) 10Volans: Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452) [13:29:49] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.2 (duration: 02m 06s) [13:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:02] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: deploying v3.2 [13:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:46] (03CR) 10Eevans: [C: 03+1] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [13:30:50] (03CR) 10Volans: [C: 03+2] Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [13:31:11] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: deploying v3.2 (duration: 01m 08s) [13:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:54] (03Merged) 10jenkins-bot: Removed temporary migration script to 3.2 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805813 (https://phabricator.wikimedia.org/T296452) (owner: 10Volans) [13:32:09] (03PS3) 10Volans: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:33:23] (03CR) 10Volans: [C: 03+2] ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:33:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T310011)', diff saved to https://phabricator.wikimedia.org/P29826 and previous config saved to /var/cache/conftool/dbconfig/20220615-133326-marostegui.json [13:33:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:33:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:31] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [13:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29827 and previous config saved to /var/cache/conftool/dbconfig/20220615-133334-marostegui.json [13:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:03] (03Merged) 10jenkins-bot: ganeti-netbox-sync: Add netbox 3.2 support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790991 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:34:22] (03PS13) 10Elukey: Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) [13:35:47] (03CR) 10Elukey: [C: 03+2] Add new Cassandra cluster for ML cache/feature-store workloads in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/793714 (https://phabricator.wikimedia.org/T302232) (owner: 10Elukey) [13:36:11] (03PS2) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) [13:36:34] (03PS1) 10Ottomata: eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391) [13:36:53] (03CR) 10Sergio Gimeno: MentorDashboard: enable the Vue version of the dashboard in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805490 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [13:38:16] FYI, I'm deploying with --force to work around a scap bug. [13:38:17] !log awight@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWTransclusionDialog.js: Backport: [[gerrit:805745|Restore internal mechanism to use either back or close button (T310602)]] (duration: 00m 37s) [13:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:20] T310602: Unexpected back button behavior in VisualEditor's citation dialog - https://phabricator.wikimedia.org/T310602 [13:38:43] (03CR) 10Ottomata: [C: 03+2] eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391) (owner: 10Ottomata) [13:38:45] (03CR) 10Awight: [C: 03+2] "Backport deployment." [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440) (owner: 10Mainframe98) [13:38:47] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventstreams - expose mediawiki.revision-tags-change [deployment-charts] - 10https://gerrit.wikimedia.org/r/805814 (https://phabricator.wikimedia.org/T294391) (owner: 10Ottomata) [13:38:49] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [13:39:48] (03PS2) 10Filippo Giunchedi: icinga: check commons.w.o with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/804274 (https://phabricator.wikimedia.org/T305847) [13:39:50] (03PS1) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [13:39:51] (03PS1) 10Filippo Giunchedi: prometheus: use hostname for blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/805816 (https://phabricator.wikimedia.org/T305847) [13:39:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29828 and previous config saved to /var/cache/conftool/dbconfig/20220615-133954-root.json [13:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29829 and previous config saved to /var/cache/conftool/dbconfig/20220615-133958-root.json [13:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29830 and previous config saved to /var/cache/conftool/dbconfig/20220615-134004-root.json [13:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:10] awight: no unfortunately, filing a bug now [13:41:08] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:11] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [13:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:53] RECOVERY - Confd template for /var/lib/gdnsd/discovery-netbox.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [13:43:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) @cmjohnson Ah, OK, I sort-of assumed we had HTML5 console everywhere. Now I know better :) I've just tried turning the system on again, and it's still finding the myste... [13:45:22] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [13:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:25] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [13:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:36] (03PS2) 10Ayounsi: wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) [13:46:49] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:47:07] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] wmf-netbox: Netbox 3.2 compatibility [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/790975 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:48:45] (03PS1) 10Majavah: sonofgridengine: grid_configurator: fix parameter name [puppet] - 10https://gerrit.wikimedia.org/r/805818 [13:49:25] !log ayounsi@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002 [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002 [13:51:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:00] !log ayounsi@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002 [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:37] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: deploy new homer wmf-netbox - ayounsi@cumin2002 [13:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1032 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29831 and previous config saved to /var/cache/conftool/dbconfig/20220615-135458-root.json [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1033 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29832 and previous config saved to /var/cache/conftool/dbconfig/20220615-135502-root.json [13:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1034 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P29833 and previous config saved to /var/cache/conftool/dbconfig/20220615-135508-root.json [13:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:25] (03PS1) 10Slyngshede: systemd::timer::job cleanup now absent cronjobs. [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673) [13:55:53] (03Merged) 10jenkins-bot: Fix deletion of translation pages outside of NS_MAIN namespace [extensions/Translate] (wmf/1.39.0-wmf.16) - 10https://gerrit.wikimedia.org/r/805749 (https://phabricator.wikimedia.org/T310440) (owner: 10Mainframe98) [13:56:11] PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:57:31] !log awight@deploy1002 Synchronized php-1.39.0-wmf.16/extensions/Translate/src/PageTranslation/DeleteTranslatableBundleSpecialPage.php: Backport: [[gerrit:805749|Fix deletion of translation pages outside of NS_MAIN namespace (T310440)]] (duration: 00m 32s) [13:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:36] T310440: Attempting to delete a translation page using Special:PageTranslationDeletePage shows an internal error - https://phabricator.wikimedia.org/T310440 [13:57:43] mainframe98: Deployed, thanks for the patch! [13:58:08] !log EU afternoon backport window complete. [13:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:58:31] awight: Thank you. I'll followup with testing after the train progresses [13:58:41] Great! [13:59:45] 10SRE, 10Traffic: Let's Encrypt issuance chains update - https://phabricator.wikimedia.org/T283164 (10SLyngshede-WMF) [13:59:52] (03PS1) 10Volans: reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 [14:00:19] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [14:00:46] (03PS1) 10Majavah: P:toolforge::redis_sentinel: stop sentinel and puppet fighting [puppet] - 10https://gerrit.wikimedia.org/r/805823 (https://phabricator.wikimedia.org/T309014) [14:01:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans) [14:01:13] (03CR) 10Volans: [C: 03+2] reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans) [14:01:38] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:11] (03Merged) 10jenkins-bot: reports: puppetdb fix removed method clone() [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/805822 (owner: 10Volans) [14:02:28] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35863/console" [puppet] - 10https://gerrit.wikimedia.org/r/805823 (https://phabricator.wikimedia.org/T309014) (owner: 10Majavah) [14:03:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:03:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29834 and previous config saved to /var/cache/conftool/dbconfig/20220615-140505-marostegui.json [14:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:10] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:05:57] [URL redirect patch] Would someone please review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/791321 ? Thanks in advance! [14:06:01] (03PS2) 10Slyngshede: systemd::timer::job cleanup now absent cronjobs. [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673) [14:07:05] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [14:07:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:22] (03PS3) 10Jbond: scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 [14:07:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/805820 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:07:33] (03CR) 10Jbond: [C: 03+2] scap: update venv to use the system ca bundle [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/804572 (owner: 10Jbond) [14:07:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) @MatthewVernon I disabled the internal SD drive and internal USB. I am hoping that works, I do not want to disable external USBs or I cannot use them on-site. Can you try n... [14:07:43] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:12] !log jnuche@deploy1002 Installing scap version "4.9.4" for 558 hosts [14:08:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5003.eqsin.wmnet [14:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:08:26] (03PS1) 10Btullis: Update the version of the datahub containers that is deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/805826 (https://phabricator.wikimedia.org/T310079) [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:31] !log jnuche@deploy1002 Installation of scap version "4.9.4" completed for 558 hosts [14:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:12] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:53] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:41] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:10] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:15:13] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-cache1001.eqiad.wmnet with OS buster [14:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:25] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5003.eqsin.wmnet [14:17:10] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:57] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:43] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:19:44] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29836 and previous config saved to /var/cache/conftool/dbconfig/20220615-142010-marostegui.json [14:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:57] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:19] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:22:44] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:53] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:25] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:27:24] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [14:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] !log hnowlan@deploy1002 Synchronized private/PrivateSettings.php: T308670 credentials to access the similar-users service (duration: 03m 32s) [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:13] T308670: Configure SimilarEditors in production with Similarusers credentials - https://phabricator.wikimedia.org/T308670 [14:30:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-cache1001.eqiad.wmnet with reason: host reimage [14:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:00] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P29838 and previous config saved to /var/cache/conftool/dbconfig/20220615-143515-marostegui.json [14:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] [URL redirect patch] Would someone please review/merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/791321 ? Thanks in advance! [14:36:09] (^ @dcaro maybe? :) [14:38:48] hashar: Do you maybe know who could +2 https://gerrit.wikimedia.org/r/c/integration/docroot/+/791111 ? TIA :) [14:41:05] looking [14:42:46] andre: merged :) [14:43:20] andre: is there anything else needed for it to take effect? (aside from waiting for puppet to run) [14:43:37] dcaro: I hope not. :P Thank you a lot! [14:43:42] 👍 [14:46:00] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:46:47] !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "sync data - jbond@cumin1001" [14:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:37] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:06] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [14:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:34] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001" [14:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T310011)', diff saved to https://phabricator.wikimedia.org/P29839 and previous config saved to /var/cache/conftool/dbconfig/20220615-145020-marostegui.json [14:50:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance [14:50:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance [14:50:25] !log ALTER-ing replication for codfw (Cassandra) expansion -- T307641 [14:50:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29840 and previous config saved to /var/cache/conftool/dbconfig/20220615-145028-marostegui.json [14:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:30] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [14:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:39] T307641: AQS multi-datacenter cluster expansion - https://phabricator.wikimedia.org/T307641 [14:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:05] !log jbond@cumin1001 START - Cookbook sre.pdus.uptime [14:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:14] !log jbond@cumin1001 END (ERROR) - Cookbook sre.pdus.uptime (exit_code=97) [14:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:28] !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password [14:53:28] !log jbond@cumin1001 END (FAIL) - Cookbook sre.pdus.rotate-password (exit_code=99) [14:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:36] !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password [14:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:44] !log jbond@cumin1001 END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0) [14:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:57] !log jbond@cumin1001 START - Cookbook sre.pdus.rotate-password [14:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0) [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [14:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1059.eqiad.wmnet with reason: host reimage [14:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:58] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:58:09] jbond, volans and I are happy to announce that Netbox got successfully upgraded. You can now resume using it, as well as cookbooks. [14:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:55] !log jbond@cumin1001 START - Cookbook sre.hosts.ipmi-password-reset [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:18] !log jbond@cumin1001 Updating IPMI password on 1 hosts - jbond@cumin1001 [14:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [14:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:04] brennen, thcipriani, and mutante: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1500). [15:00:29] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1059.eqiad.wmnet with reason: host reimage [15:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:38] o/ [15:00:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6003.drmrs.wmnet [15:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:34] let me know before any write happens to stop 2 of the replicas for phabricator, as an extra rollback protection, thcipriani [15:01:38] o/ [15:02:04] jynus: will do! Thank you! [15:02:28] that way, should the worst thing happen, we have immediate rollback, as if nothing had happened [15:02:42] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:09] here :) [15:03:24] !log phabricator maintenance about to start [15:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:05] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance [15:05:07] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1001.eqiad.wmnet with reason: maintenance [15:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:26] is there a ticket (I know it won't be of much use :-) [15:05:29] ? [15:06:57] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:30] silence submitted for phabricator in alertmanager [15:08:37] icinga downtime sent via cumin [15:08:48] PROBLEM - ganeti-wconfd running on ganeti6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [15:09:16] mutante: FYI the downtime cookbook does downtime the host also on alermanager, anything host-related [15:11:06] volans: icinga was phab1001.eqiad.wmnet, alertmanager was phabricator.wikimedia.org:443 [15:11:27] it has 2 IPs. so not so sure about that [15:11:57] ack, we need to improve the downtime cookbook to allow to downtime those toos in AM, all the bits are already in spicerack [15:12:01] Can Not Connect to MySQL - should I stop things now? [15:12:51] jynus: not yet, we are not writing anything [15:12:55] ok [15:15:09] PROBLEM - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 5255 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Phabricator [15:15:48] some downtime was missing, I guess= [15:15:56] * Emperor here [15:16:10] maintenance, closing the incident [15:16:25] typical I get paged when downstairs making tea. [15:16:33] moritzm: thanks [15:16:34] thanks moritzm [15:16:55] RECOVERY - https://phabricator.wikimedia.org #page on phabricator.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 39622 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Phabricator [15:17:11] ^^ known, it is being upgraded [15:17:24] sorry,I tried to prevent exactly that [15:17:26] with the downtimes [15:17:45] fyi the page was generated from the new prometheus::blackbox::check::http (https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/phabricator/monitoring.pp#L50) its possible the cookbook needs updateing to take care of checks like this (cc godog volans ) [15:18:12] actually interesting ^ [15:18:22] should I file a ticket about this? [15:18:24] ni [15:18:26] no [15:18:33] those are two separate things [15:18:34] ok [15:18:43] the icinga alert is because there is a virtual host in icinga for phabricator.wikimedia.org [15:18:45] the downtime cookbook downtimes hosts as of now, a separate silence was added manally [15:18:53] and you cant send downtimes to virtual hosts, afaict [15:19:00] mutante: yes you can [15:19:03] oh sorry i missed it was an icinga alert ignore me [15:19:24] --force Override the check that use a Cumin query to validate the given hosts. Useful when you want to downtime a "host" that is not a real host like [15:19:27] a service or not anymore queryable via Cumin. [15:20:18] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on phabricator.wikimedia.org with reason: maintenace [15:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:20] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phabricator.wikimedia.org with reason: maintenace [15:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] thanks volans. done! [15:20:25] ==> Will downtime 1 unverified hosts: phabricator.wikimedia.org [15:20:30] with --force [15:20:37] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host ms-be1059.eqiad.wmnet with OS bullseye [15:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:49] jbond: i think both might be true [15:21:18] mutante: yes its possible the other one would have tiggered eventully yes [15:22:00] still, the duality of host vs service model, inherited from icinga may cause confusion for some time, I predict [15:23:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29841 and previous config saved to /var/cache/conftool/dbconfig/20220615-152315-marostegui.json [15:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:20] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [15:24:05] icinga has 2 hosts, phab1001.eqiad.wmnet with all the standard services on it AND phabricator.wikimedia.org as a virtual host [15:24:16] then there is alertmanager that has phabricator.wikimedia.org [15:24:26] and then in addition to both of these..there was what created the actual page [15:24:44] more important, volans: https://www.youtube.com/watch?v=zIV4poUZAQo [15:24:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:52] so 3 downtimes but still not the one that would have prevented the page, afaict [15:25:43] apache and phd are being stopped now [15:27:03] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:29:35] volans: currently we are replacing 'puppet agent' commands in the maintenance script with enable-puppet/disable-puppet/.. heh [15:29:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [15:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:49] thanks! [15:33:30] jynus: we're ready to go ahead here if you want to stop replication [15:33:45] ok, logging it, and wait for my ok [15:33:55] jynus: ack, thanks [15:34:11] !log stopping replication for m3 on db1117, db2078 [15:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:50] brennen: confirmed replication stopped on selected hosts, you can continue [15:35:00] jynus: thanks, going ahead [15:35:24] this is the point in time we will quickly rollback in case of the worst [15:35:33] <3 [15:35:35] !log starting phabricator deploy, momentary downtime expected while Apache restarts and migrations run [15:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:50] thanks jynus, it's better this way, we appreciate it [15:36:01] Here for moral support seen as I filed the upgrade task [15:38:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to and previous config saved to /var/cache/conftool/dbconfig/20220615-153820-marostegui.json [15:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:17] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] Phabricator down? [15:39:36] planned maintenance [15:39:40] thanks [15:40:22] !log phabricator upgrade in progress [15:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:59] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:18] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:23] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:27] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:32] https://phabricator.wikimedia.org/source/mediawiki/ [15:51:46] Unable to Retrieve Paths [15:51:46] Command failed with error #1! COMMAND /usr/bin/sudo -E -n -u phd -- git ls-tree -z -l 68972c30d27f1b3a6e268cac0e64a7f78e8d3bb7 -- STDOUT (empty) STDERR sudo: a password is required [15:52:37] Probably related to the ongoing maintenance [15:52:48] Dylsss: thanks for reporting this [15:52:56] people are looking at it [15:53:05] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:13] 👍 [15:53:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P29843 and previous config saved to /var/cache/conftool/dbconfig/20220615-155325-marostegui.json [15:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [15:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:34] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:01] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:19] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:54] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [15:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:22] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [15:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:08] mutante: relay the "SECURITY information for phab1001.eqiad.wmnet" ? [15:59:37] jynus: where do you see that? [15:59:43] root mail [15:59:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:01:03] jynus: ah, ACK. thanks. I know why it happened. it's literally users trying to debug the error reported above [16:01:15] ah, ok [16:01:15] sudo commands being tested [16:01:32] did report it in case it needed ops attention [16:01:52] yes, thanks!:) [16:02:39] hey, the drop down to change task spaces seems gone now? [16:02:49] i dont have to do it often, but i happen to have one now and cannot edit and change space. [16:04:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:05:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-cache1001.eqiad.wmnet with OS buster [16:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:19] (ProbeDown) firing: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:58] ok [16:06:09] mutante: heyas, so i lost the ability to see task spaces or view rights on task since update? [16:06:19] someone report this already? [16:06:22] or am i first? [16:06:44] regarding, thanos-query, looking... [16:07:03] robh: you are first. there is still other ongoing stuff being debugged right now [16:07:10] is that the metrics frontend? [16:07:25] Ok, cool. Yeah so I used to be able to click 'edit task' and see both the space and the view/edit rights and now can see none of those things [16:07:29] which is problematic heh [16:07:49] i now seem to be prsented with the form for users with no advanced rights heh [16:08:17] so i can edit the basic task fields like assign, title, status, priority, description, tags, subscribers, and due date. just cannot see space, edit and view rights [16:08:28] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] just add to the list ; D [16:08:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T310011)', diff saved to https://phabricator.wikimedia.org/P29844 and previous config saved to /var/cache/conftool/dbconfig/20220615-160830-marostegui.json [16:08:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:08:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:35] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [16:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29845 and previous config saved to /var/cache/conftool/dbconfig/20220615-160838-marostegui.json [16:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:10:18] (ProbeDown) resolved: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:07] robh: could you maybe create a ticket? there is a lot going on still [16:12:29] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1007.mgmt.eqiad.wmnet with reboot policy FORCED [16:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:35] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1008.mgmt.eqiad.wmnet with reboot policy FORCED [16:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:50] elastic doesn't seem to be 100% happy still: https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40&orgId=1&refresh=5m&from=1655298817054&to=1655309617054 [16:14:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:15:03] mutante: ok, doing now! [16:15:19] just phab tag or anything else ya think? [16:15:22] Dylsss: we think we got that problem, thanks for the report [16:16:00] robh: thank you! yea, just phab tag is good enough. people already start looking [16:17:36] cool, done and done, and yeah its not like a ubn [16:17:43] im sure there are ubn tasks pending ;D [16:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:21:34] !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host backup1009.eqiad.wmnet [16:21:35] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host backup1009.eqiad.wmnet [16:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:24:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] !log krinkle@deploy1002 Synchronized multiversion/: Id8cdb8aef70f6672 (duration: 03m 41s) [16:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:31] Dylsss: the issue you reported should be gone [16:27:42] robh: your issue will still be debugged [16:27:44] Yep, it is gone [16:27:48] great [16:29:13] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:29:41] mutante: no doubt, i just didnt want to make it sound urgent cuz i talked in irc first is all [16:29:45] !log phabricator upgrade finished [16:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:01] jynus: i think we are ready to re-enable replication if you're still around [16:30:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1006.mgmt.eqiad.wmnet with reboot policy FORCED [16:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:17] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1008.mgmt.eqiad.wmnet with reboot policy FORCED [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1007.mgmt.eqiad.wmnet with reboot policy FORCED [16:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1009.mgmt.eqiad.wmnet with reboot policy FORCED [16:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1010.mgmt.eqiad.wmnet with reboot policy FORCED [16:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1011.mgmt.eqiad.wmnet with reboot policy FORCED [16:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:13] PROBLEM - cassandra-a CQL 10.64.130.9:9042 on ml-cache1001 is CRITICAL: connect to address 10.64.130.9 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:32:17] PROBLEM - cassandra-a service on ml-cache1001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:34:53] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:37:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1012.mgmt.eqiad.wmnet with reboot policy FORCED [16:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:24] brennen: I just saw the ping [16:40:45] phab looking good? [16:41:17] jynus: yep, all good. [16:42:08] I will start the eqiad one leave the codfw stopped [16:42:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29847 and previous config saved to /var/cache/conftool/dbconfig/20220615-164222-marostegui.json [16:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:28] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [16:42:30] to still have a "we didn't realize some critical bug" or something [16:42:38] but to reenable eqiad redundancy [16:42:39] perfect [16:44:05] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:44:08] !log reestarting replication for m3 on db1117, not db2078 [16:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:49:56] jouncebot nowandnext [16:49:56] For the next 0 hour(s) and 10 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1500) [16:49:56] In 1 hour(s) and 10 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800) [16:49:56] In 1 hour(s) and 10 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800) [16:53:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org/?q=alertname%3DLogstashIndexingFailures [16:54:48] !log train 1.39.0-wmf.16 (T308069): no current blockers - rolling to group0 [16:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:53] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [16:57:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29848 and previous config saved to /var/cache/conftool/dbconfig/20220615-165727-marostegui.json [16:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:43] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:24] taavi: legoktm: Reedy: looks like Wikibugs died somehow :-\ I have no idea how to restart it though but there is some doc at https://www.mediawiki.org/wiki/Wikibugs [17:03:29] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.16 refs T308069 [17:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:34] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [17:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:06:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:07:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1011.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:03] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1009.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1010.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:45] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1012.mgmt.eqiad.wmnet with reboot policy FORCED [17:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1013.mgmt.eqiad.wmnet with reboot policy FORCED [17:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:05] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1014.mgmt.eqiad.wmnet with reboot policy FORCED [17:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P29849 and previous config saved to /var/cache/conftool/dbconfig/20220615-171233-marostegui.json [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:30] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host an-presto1015.mgmt.eqiad.wmnet with reboot policy FORCED [17:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:51] things seem stable at group0, taking a break before regular train window. [17:27:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T310011)', diff saved to https://phabricator.wikimedia.org/P29851 and previous config saved to /var/cache/conftool/dbconfig/20220615-172738-marostegui.json [17:27:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:27:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1133.eqiad.wmnet with reason: Maintenance [17:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:44] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [17:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:33:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1015.mgmt.eqiad.wmnet with reboot policy FORCED [17:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1014.mgmt.eqiad.wmnet with reboot policy FORCED [17:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1013.mgmt.eqiad.wmnet with reboot policy FORCED [17:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:39:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host stat1010.mgmt.eqiad.wmnet with reboot policy FORCED [17:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:01] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:33] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS buster [17:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:45] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:52:13] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1014.eqiad.wmnet with OS buster [17:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [17:54:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2103.codfw.wmnet with reason: Maintenance [17:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 14 hosts with reason: Maintenance [17:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 14 hosts with reason: Maintenance [17:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS buster [17:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host stat1010.mgmt.eqiad.wmnet with reboot policy FORCED [17:58:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:46] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1015.eqiad.wmnet with OS buster [17:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] brennen and jeena: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800). [18:00:05] brennen and jeena: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T1800). [18:00:41] o/ [18:04:55] o/ [18:06:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:58] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.16 refs T308069 [18:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:05] T308069: 1.39.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T308069 [18:07:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:07:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:42] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.16 refs T308069 (duration: 03m 43s) [18:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:15] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:13:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:27] brennen: something weird with db since 15:55 [18:14:49] I think it is only codfw [18:15:04] maybe there is maintenance [18:15:58] yeah, I think there is s1 maintenance on codfw, probably ignorable [18:16:31] sorry, I thought it was deployment-related [18:17:30] Thanks for checking on it jynus [18:18:14] I checked after the kafka thing, but it could be something else (there is not a lot of logs created) [18:19:11] jynus: ack, thx [18:19:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:19:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:19:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:30] ^this increase is 300% so something else must be going on [18:20:44] (it is not the db thing) [18:21:16] hmm - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&viewPanel=2&refresh=5m doesn't seem to correlate with deploys particularly [18:21:33] aqs_cassandra, I think? [18:21:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:21:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [18:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29853 and previous config saved to /var/cache/conftool/dbconfig/20220615-182140-marostegui.json [18:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:45] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [18:21:53] should we notify data engineering? [18:23:10] awight: post train checkup report: backport worked. Thanks again! [18:23:31] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&var-datasource=eqiad+prometheus%2Fops&var-input=kafka%2Fclienterror-eqiad&viewPanel=39&from=1655306604837&to=1655317404837 [18:24:41] even if it is service specific, I am worried it could impact logging for other services [18:24:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [18:26:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:10] re: notifying data engineering - yes? i'm over my head here. [18:28:28] aqs is them right? [18:28:35] I am not 100% sure [18:29:32] based on https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS i think so? [18:29:46] does anyone know someone that would be up right now? [18:30:00] e.g. in americas timezone? [18:31:02] milimetric and ottomata appear to be in US tz [18:31:31] I guess that is ping enough :-) [18:31:35] hello [18:31:47] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:32:27] reading backscroll but still not sreu what is up? [18:32:31] ottomata: https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m AQS is creating 300% more logs than all other infra since a few minutes ago [18:32:40] brennen: the AQS cluster in codfw is still being setup [18:33:00] cc btullis [18:33:08] https://phabricator.wikimedia.org/T309808 [18:33:59] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&viewPanel=39&from=1655307212993&to=1655318012993 looks scary, may be delaying the global logging infra [18:35:09] Argh, I'm away from keyboard at the moment. I think that this is related to work that urandom has been doing on the new aqs cluster in codfw. [18:35:16] (maybe not, but looks worrying anyway) [18:35:45] is someone can check other production logs not impacted, it can wait [18:36:01] (and no production aqs impact) [18:36:31] the aqs service was deployed along with Cassandra in codfw, and for some reason it's not happy [18:37:00] lmata and cwhite notified is about the logspam a while ago, but then it subsided. Looks like it's back to noisy again. [18:37:26] the log messages are its attempt to tell us why, but I don't know what "connection" means (that's the entirety of the message) [18:37:57] I just restarted it on one node, maybe that caused a spike? [18:38:28] Can we just stop rsyslog on aqsw* to stop the flow of messages into Logstash? [18:38:45] it started at 18:05 [18:39:01] Sorry, typing on phone. Rad was supposed to be aqs2* [18:40:56] I rolled out some filters about 30min ago to drop the overly verbose logs from Kafka. Will take a while to burn through the backlog. [18:42:10] I may have worried more than needed- I just checked and logs for other services seem to be fresh, so no impact there, AFAICS [18:42:28] Maybe the aqs service isn't happy if it's trying to contact druid in eqiad. Just a thought. [18:42:37] !log wikibugs (irc bot for Phabricator/Gerrit) is no more working and would need a restart T310734 [18:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:43] T310734: Wikibugs no more sends Gerrit/Phabricator announcements to IRC 2022-06-15 - https://phabricator.wikimedia.org/T310734 [18:43:44] Thanks for the updates all. [18:44:37] btullis: no idea, these error messages of no help [18:57:02] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle) [19:00:20] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle) [19:01:12] 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expecta... [19:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29854 and previous config saved to /var/cache/conftool/dbconfig/20220615-190140-marostegui.json [19:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:45] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [19:06:52] (03PS1) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T291946) [19:07:47] (03PS2) 10BCornwall: Traffic: Port IPsec/Strongswan connection alert [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) [19:09:48] (03PS1) 10Ayounsi: Netbox: expose Netbox on the frontend's FQDN [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) [19:09:50] (03PS1) 10Ayounsi: Prometheus: scrap Netbox django metrics [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) [19:12:07] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/35879/" [puppet] - 10https://gerrit.wikimedia.org/r/805888 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [19:12:18] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Krinkle) [19:13:15] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 262 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:16:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29855 and previous config saved to /var/cache/conftool/dbconfig/20220615-191645-marostegui.json [19:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:57] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35880/prometheus1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/805889 (https://phabricator.wikimedia.org/T243928) (owner: 10Ayounsi) [19:19:41] (03PS4) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) [19:20:42] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35881/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [19:20:49] 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) just a note for serviceops: policy.wikimedia.org is not currently under the control of SRE/prod servers... [19:23:13] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Complete Netbox prometheus scraping - https://phabricator.wikimedia.org/T243928 (10ayounsi) a:03ayounsi [19:23:35] 10SRE, 10Privacy Engineering, 10WMF-Legal, 10Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104 (10Dzahn) Looks like T310738 would make this obsolete. [19:28:54] 10SRE, 10DNS, 10WMF-Legal, 10serviceops, 10wikimediafoundation.org: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) There are incoming redirects into policy.wikimedia.org: https://wikimedia.org/stopsurveillance -> http... [19:31:03] 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) dse-k8s-worker1005 e1 U33 port 33 Cableid 20220052 dse-k8s-worker1006 e3 U33 port 33 Cableid 20220060 dse-k8s-worker1007 f1 U33 port 33 Cable... [19:31:06] 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) [19:31:09] !log wikibugs IRC bot has been restarted by valhallasw \o/ # T310734 [19:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:14] T310734: Wikibugs no more sends Gerrit/Phabricator announcements to IRC 2022-06-15 - https://phabricator.wikimedia.org/T310734 [19:31:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q4: rack/setup/install dse-k8s-worker100[5-8] - https://phabricator.wikimedia.org/T307400 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P29856 and previous config saved to /var/cache/conftool/dbconfig/20220615-193150-marostegui.json [19:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:18] (03CR) 10Ssingh: [V: 03+1] "PCC for dnsbox and centrallog hosts: https://puppet-compiler.wmflabs.org/pcc-worker1003/35882/" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [19:37:14] (03CR) 10Ssingh: [V: 03+1] "PCC looks OK for existing bird hosts but this change is not ready for review yet. DO NOT MERGE WIP!" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [19:40:15] PROBLEM - SSH on ms-be2041.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:43:25] (03PS1) 10Ayounsi: wmf-netbox: don't crash with "provider network" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/805898 (https://phabricator.wikimedia.org/T310591) [19:46:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T310011)', diff saved to https://phabricator.wikimedia.org/P29857 and previous config saved to /var/cache/conftool/dbconfig/20220615-194655-marostegui.json [19:46:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [19:47:00] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [19:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29858 and previous config saved to /var/cache/conftool/dbconfig/20220615-194703-marostegui.json [19:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:03] !log hashar@deploy1002 Started deploy [integration/docroot@b95391b]: Add Developer Portal - T302809 [19:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:08] T302809: Add dev portal to list of microsites on doc.wikimedia.org - https://phabricator.wikimedia.org/T302809 [19:50:14] !log hashar@deploy1002 Finished deploy [integration/docroot@b95391b]: Add Developer Portal - T302809 (duration: 00m 10s) [19:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:23] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:56:07] (03CR) 10Hashar: [C: 03+2] wmf-config: Add audience to gdi-survey on cawiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [20:01:01] jouncebot: next [20:01:01] In 9 hour(s) and 58 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220616T0600) [20:01:03] (03PS5) 10Ssingh: bird: upgrade configuration to bird2 (merge IPv4 and IPv6 configurations) [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) [20:01:28] jouncebot: now [20:01:28] For the next 0 hour(s) and 58 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220615T2000) [20:01:36] Huh not sure why the bot didn't announce it [20:01:45] (03PS2) 10Catrope: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) [20:01:55] (03CR) 10Catrope: [C: 03+2] Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope) [20:02:13] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35883/console" [puppet] - 10https://gerrit.wikimedia.org/r/805874 (https://phabricator.wikimedia.org/T310574) (owner: 10Ssingh) [20:02:52] (03Merged) 10jenkins-bot: Remove unused setting wgQuickSurveysUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/804014 (https://phabricator.wikimedia.org/T285890) (owner: 10Catrope) [20:07:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:27] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:804014|Remove unused setting wgQuickSurveysUseVue (T285890)]] (duration: 03m 38s) [20:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:30] T285890: Remove OOUI surveys and default to Vue.js - https://phabricator.wikimedia.org/T285890 [20:09:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:24] (03CR) 10BCornwall: "I'd love feedback on whether I should explore averaging the values so that flips between 1 and 2 are not ignored." [alerts] - 10https://gerrit.wikimedia.org/r/805887 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [20:20:41] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:22:02] (03PS1) 10Dzahn: cloud/devtools: fix hiera data for renamed gitlab-runner instance [puppet] - 10https://gerrit.wikimedia.org/r/805900 [20:22:35] (03CR) 10Dzahn: [C: 03+2] cloud/devtools: fix hiera data for renamed gitlab-runner instance [puppet] - 10https://gerrit.wikimedia.org/r/805900 (owner: 10Dzahn) [20:25:44] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T309741 (10phaultfinder) [20:36:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCOldPoolFlatlined [20:41:25] RECOVERY - SSH on ms-be2041.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29859 and previous config saved to /var/cache/conftool/dbconfig/20220615-204717-marostegui.json [20:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:22] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [21:02:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:02:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29860 and previous config saved to /var/cache/conftool/dbconfig/20220615-210223-marostegui.json [21:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:41] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:11:43] (03PS3) 10MewOphaswongse: Structured task: enable free text for "other" rejection reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/805480 (https://phabricator.wikimedia.org/T304099) [21:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P29861 and previous config saved to /var/cache/conftool/dbconfig/20220615-211728-marostegui.json [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:36] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond) [21:21:24] (03CR) 10Jbond: [C: 03+2] redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond) [21:29:31] (03Merged) 10jenkins-bot: redfish: update poll task to deal with older models [software/spicerack] - 10https://gerrit.wikimedia.org/r/805782 (owner: 10Jbond) [21:32:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T310011)', diff saved to https://phabricator.wikimedia.org/P29862 and previous config saved to /var/cache/conftool/dbconfig/20220615-213233-marostegui.json [21:32:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [21:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [21:32:37] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [21:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29863 and previous config saved to /var/cache/conftool/dbconfig/20220615-213241-marostegui.json [21:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10Cmjohnson) [21:46:48] 10SRE, 10ops-eqiad, 10DBA: db1173 won't boot up - https://phabricator.wikimedia.org/T310595 (10wiki_willy) a:05wiki_willy→03Cmjohnson [21:49:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1014.eqiad.wmnet with OS buster [21:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster [22:02:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage [22:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1015.eqiad.wmnet with OS buster [22:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1016.eqiad.wmnet with OS buster [22:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29864 and previous config saved to /var/cache/conftool/dbconfig/20220615-220329-marostegui.json [22:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:34] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [22:04:21] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:05:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1014.eqiad.wmnet with reason: host reimage [22:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:12:56] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1016.eqiad.wmnet with OS buster [22:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1016.eqiad.wmnet with OS buster executed w... [22:14:32] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage [22:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1016.eqiad.wmnet with OS buster [22:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:49] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster [22:17:00] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1016.eqiad.wmnet with OS buster [22:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:05] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host aqs1016.eqiad.wmnet with OS buster executed with errors: - aqs1016... [22:17:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1015.eqiad.wmnet with reason: host reimage [22:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1014.eqiad.wmnet with OS buster [22:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1014.eqiad.wmnet with OS buster completed:... [22:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P29865 and previous config saved to /var/cache/conftool/dbconfig/20220615-221834-marostegui.json [22:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1015.eqiad.wmnet with OS buster [22:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host wdqs1015.eqiad.wmnet with OS buster completed:... [22:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P29866 and previous config saved to /var/cache/conftool/dbconfig/20220615-223339-marostegui.json [22:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Cmjohnson) 1014 and 1015 are installed, 1016 shows that no cables are connected. John will look at that in the morning. [22:46:12] RECOVERY - Check systemd state on an-tool1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T310011)', diff saved to https://phabricator.wikimedia.org/P29867 and previous config saved to /var/cache/conftool/dbconfig/20220615-224845-marostegui.json [22:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:50] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [23:39:11] (03PS1) 10Cwhite: logstash: add test2 partition to ecs-test policy [puppet] - 10https://gerrit.wikimedia.org/r/805921 (https://phabricator.wikimedia.org/T301760)