[00:01:08] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:08:24] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:09:48] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:17:54] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:20:06] PROBLEM - WDQS high update lag on wdqs1012 is CRITICAL: 6.247e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:34:48] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:35:14] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:42:24] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:47:58] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:12:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:21:40] RECOVERY - WDQS high update lag on wdqs1012 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.046e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:24:32] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [01:29:30] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:34:18] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 504 (expecting: 200): /v2/suggest/sections/titles/{fr [01:34:18] (Suggest target section titles for given source sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [01:41:28] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:04:41] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [02:11:22] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [02:21:56] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:34] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [02:32:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [02:39:30] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received: /v2/suggest/sections/titles/{from}/{to} (Suggest target section titles for given source sections) timed out b [02:39:30] response was received https://wikitech.wikimedia.org/wiki/CX [02:41:46] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [02:56:20] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:12:56] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:20:04] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [03:20:10] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:20:24] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:27:26] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning with a given provider) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [03:46:34] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:53:54] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source section [03:53:54] nslate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:03:06] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:10:22] PROBLEM - SSH on kubernetes1014 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:24:26] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:32:28] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [04:39:50] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [04:42:50] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:57:16] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:59:06] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:00:20] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:02:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:03:44] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:32] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:06:30] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}/{revision} (Fetch enwiki protected page) timed out before a response was received: /v1/dictionary/{word}/{from}/{to}/{provider} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was [05:06:30] https://wikitech.wikimedia.org/wiki/CX [05:18:16] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:19:20] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, mov [05:19:20] ://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:25:14] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a resp [05:25:14] received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:25:36] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:29:20] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:31:36] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:32:02] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:32:28] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:37:02] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:39:02] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title} (Translate enwiki protected page) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links [05:39:02] t language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:43:30] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:43:58] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:50:44] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/dictionary/{word}/{from}/{to} (Fetch dictionary meaning without specifying a provider) timed out before a response was received: /v2/page/{sourcelanguage}/{targetlanguage}/{title}/{revision} (Translate enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [06:00:18] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [06:04:41] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [06:05:00] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:10:21] !log draining kubernetes1014 [06:10:24] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:17] This also fixed the Kafka rdf-streaming-updater, which is needed for the WDQS updater (which is lagging quite a bit by now, so that Wikidata editing got throttled). [06:19:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [06:20:04] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:20:45] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on kubernetes1014.eqiad.wmnet with reason: potential HW error [06:20:46] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on kubernetes1014.eqiad.wmnet with reason: potential HW error [06:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:02] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:31:30] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:36:20] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:41:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:45:58] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:05:14] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:08:52] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/list/{tool} (Get the MT tool between two language pairs) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [07:09:06] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:11:02] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [07:14:26] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:29:28] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:34:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:34:45] 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10KinneretG) [07:41:10] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:47:58] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:52:32] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:00:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:00:54] (03PS1) 10Muehlenhoff: Remove access for seve-kim [puppet] - 10https://gerrit.wikimedia.org/r/760518 [08:02:03] !log powercycle kubernetes1014 - T301099 [08:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:09] T301099: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 [08:04:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:04:14] that's me [08:04:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:04:25] (03PS2) 10Muehlenhoff: Remove access for seve-kim [puppet] - 10https://gerrit.wikimedia.org/r/760518 [08:05:18] RECOVERY - SSH on kubernetes1014 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:06:22] RECOVERY - kubelet operational latencies on kubernetes1014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:06:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:09:26] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1014:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:11:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:15:01] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for seve-kim [puppet] - 10https://gerrit.wikimedia.org/r/760518 (owner: 10Muehlenhoff) [08:18:42] 10SRE, 10cloud-services-team: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None [08:18:48] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:26:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:32:18] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:38:06] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:42:54] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:44:31] (03CR) 10Jelto: [V: 03+1] gitlab_runner: execute gitlab-runner as non-root (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:45:00] (03PS1) 10Ema: varnish: remove X-Wikimedia-Security-Audit leftover [puppet] - 10https://gerrit.wikimedia.org/r/760520 (https://phabricator.wikimedia.org/T229320) [08:45:34] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:10] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:55:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:09:06] !log uncordoned kubernetes1014 - T301099 [09:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:11] T301099: kubernetes1014 unreachable - https://phabricator.wikimedia.org/T301099 [09:09:28] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:14:16] did anyone do something with the query service last night, specifically wdqs1012? [09:15:08] there’s nothing in SAL, but wdqs1012 had a huge lag spike shortly after midnight (UTC), and as a result automated Wikidata edits stopped for ca. 7 hours [09:15:16] (at least I assume that’s the reason) [09:15:17] Lucas_WMDE: looking :/ [09:15:24] Lucas_WMDE: o/ wdqs is listed in icinga indeed [09:15:26] dcausse: --^ [09:15:32] ok thanks [09:20:33] (03CR) 10Vgutierrez: [C: 03+1] varnish: remove X-Wikimedia-Security-Audit leftover [puppet] - 10https://gerrit.wikimedia.org/r/760520 (https://phabricator.wikimedia.org/T229320) (owner: 10Ema) [09:21:32] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:21:46] !log temp-disable mfa for 'filippo' - T296629 [09:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:50] T296629: Deprecation of U2F API in Chrome / Enable web auth in CAS - https://phabricator.wikimedia.org/T296629 [09:24:00] (03CR) 10Ema: [C: 03+2] varnish: remove X-Wikimedia-Security-Audit leftover [puppet] - 10https://gerrit.wikimedia.org/r/760520 (https://phabricator.wikimedia.org/T229320) (owner: 10Ema) [09:28:13] (03CR) 10Gehel: [C: 04-1] "See comment inline: we're missing a decom ticket" [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [09:30:57] Lucas_WMDE: I believe this is related to the updater running in k8s which had troubles (from yesterday ~20h to this morning ~6am), wdqs1012 is just a symptom, I'll file a task to investigate the causes [09:31:07] ok thanks [09:31:10] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:35:02] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:36:00] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:36:11] (in the past issues like this were sometimes caused by a server being repooled before it had finished catching up, so I thought it might be something similar this time) [09:37:24] here I think is just that the updater (which now feeds all eqiad instances) stopped running properly during that time, causing all eqiad wdqs servers to lag [09:37:28] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [09:38:10] we could have routed wdqs trafic to codfw to limit the impact tho [09:38:12] hm, strange that Grafana isn’t showing the same lag on the other servers then [09:40:52] hm I think I know why... wdqs1012 definitely had issues prior to this updater problem (wdqs1012 was stuck at yesterday at ~10am) [09:45:26] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Add label node-role.kubernetes.io/master to masters [puppet] - 10https://gerrit.wikimedia.org/r/759741 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [09:46:36] and the wdqs lag graph in grafana is not reporting what the maxlag system is checking (fixing) [09:49:24] Lucas_WMDE: fixed the graph in grafana: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m [09:50:36] thanks (though I’m not sure what changed ^^) [09:50:48] oh, I just saw your message above the ping [09:50:50] ok that makes sense [09:55:20] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:59:18] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1004.eqiad.wmnet [09:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:10] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:04:12] PROBLEM - Check systemd state on kubemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:24] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:10:16] PROBLEM - Check systemd state on kubemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:12] hmmm expected? 😅 [10:11:29] oh.. that's the stage one [10:14:04] PROBLEM - Check systemd state on kubemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:40] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:15:12] PROBLEM - Check systemd state on kubemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:20] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:28] PROBLEM - Check systemd state on kubestagemaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:22:43] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1005.eqiad.wmnet [10:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:50] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:32:21] kubelets are my fault, rolling back [10:32:34] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:32:43] (03PS1) 10JMeybohm: Revert "Add label node-role.kubernetes.io/master to masters" [puppet] - 10https://gerrit.wikimedia.org/r/759962 [10:33:19] (03CR) 10jerkins-bot: [V: 04-1] Revert "Add label node-role.kubernetes.io/master to masters" [puppet] - 10https://gerrit.wikimedia.org/r/759962 (owner: 10JMeybohm) [10:34:20] 10SRE, 10SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794 (10fgiunchedi) a:05fgiunchedi→03None [10:35:01] (03PS2) 10JMeybohm: Revert "Add label node-role.kubernetes.io/master to masters" [puppet] - 10https://gerrit.wikimedia.org/r/759962 [10:36:10] (03CR) 10JMeybohm: [C: 03+2] Revert "Add label node-role.kubernetes.io/master to masters" [puppet] - 10https://gerrit.wikimedia.org/r/759962 (owner: 10JMeybohm) [10:36:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [10:38:18] RECOVERY - Check systemd state on kubemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [10:38:50] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: relabel 'instance' in job=prometheus with hostname [puppet] - 10https://gerrit.wikimedia.org/r/759517 (owner: 10Filippo Giunchedi) [10:39:24] RECOVERY - Check systemd state on kubemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:26] RECOVERY - Check systemd state on kubemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:44] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:36] RECOVERY - Check systemd state on kubemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:50] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:22] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:49:26] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1004.eqiad.wmnet [10:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:37] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus2004.codfw.wmnet [10:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:06] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:51:59] !log rolling upgrade of varnish from version 6.0.9 to 6.0.10 across DCs T300264 [10:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:47] (03CR) 10Muehlenhoff: gitlab_runner: execute gitlab-runner as non-root (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:00:31] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1006.eqiad.wmnet [11:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:55] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync on production [11:14:56] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [11:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:14] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync on production [11:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:00] 10SRE, 10Observability-Logging: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) [11:16:46] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:18:14] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync on production [11:18:15] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on staging [11:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:39] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync on production [11:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:02] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:27:26] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:29:50] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:35:42] (03PS1) 10Jelto: helmfiles: log SAL on presync and postsync [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 [11:36:43] (03PS2) 10Jelto: helmfiles: log SAL on presync and postsync [deployment-charts] - 10https://gerrit.wikimedia.org/r/760524 [11:40:58] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1007.eqiad.wmnet [11:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:53] (03CR) 10Jgiannelos: [C: 03+1] tegola: prepare eqiad cluster for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/759938 (owner: 10MSantos) [11:45:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: node-upgrade: ignore empty lines in host list [puppet] - 10https://gerrit.wikimedia.org/r/759776 (owner: 10Majavah) [11:53:48] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:56:02] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10jcrespo) [11:59:16] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1008.eqiad.wmnet [11:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:44] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:00:46] hi [12:01:05] no B&C today? [12:01:32] jouncebot: now [12:01:32] For the next 0 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T1200) [12:01:45] hmm [12:02:34] jouncebot: now [12:02:34] hmm [12:02:34] For the next 0 hour(s) and 57 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T1200) [12:02:45] I can deploy today [12:02:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:03:04] matthiasmullie: hi, around_ [12:03:05] ? [12:03:08] o/ [12:03:48] (03PS3) 10Matthias Mullie: Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) [12:04:29] (03CR) 10Majavah: [C: 03+2] Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) (owner: 10Matthias Mullie) [12:05:14] (03Merged) 10jenkins-bot: Stop capturing media change tags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756014 (https://phabricator.wikimedia.org/T286362) (owner: 10Matthias Mullie) [12:05:34] matthiasmullie: your patch is live on mwdebug1001, please test [12:05:39] checking [12:06:26] taavi: lgtm [12:06:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Switch to plain disk storage [12:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1004.eqiad.wmnet with reason: Switch to plain disk storage [12:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:42] !log revert kubestagetcd1004 to plain disk storage [12:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:09] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (1/2) (duration: 00m 50s) [12:08:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:15] T286362: [XL] mw-add-media and mw-remove-media tags are added to edits without changes in media - https://phabricator.wikimedia.org/T286362 [12:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:50] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:09:06] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (2/2) (duration: 00m 50s) [12:09:21] taavi: thanks! [12:09:26] you're welcome [12:09:37] taavi@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [12:09:42] umh [12:09:50] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (2/2) (duration: 00m 50s) [12:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:01] nn1l2: have you checked if your changes will affect any gadget definitions or abuse filters that may be checking just for `patrolmarks`? [12:12:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:12:11] (03Abandoned) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer] - 10https://gerrit.wikimedia.org/r/759707 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:12:25] no, I have not [12:12:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:12:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:17] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10aborrero) [12:14:40] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/760537 (owner: 10L10n-bot) [12:14:50] I don't see anything relevant in https://global-search.toolforge.org/?q=patrolmarks&namespaces=2%2C4%2C8&title=%28Gadgets-definition%7C.*%5C.%28js%7Ccss%7Cjson%29%29 or search-filters, so continuing [12:14:56] (03PS4) 10Majavah: Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) (owner: 104nn1l2) [12:15:16] (03CR) 10Majavah: [C: 03+2] Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) (owner: 104nn1l2) [12:16:27] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10aborrero) 05Open→03Stalled >>! In T300032#7656713, @Volans wrote: > @aborrero thanks for opening this task! > > I had a chat with @jbond on... [12:16:55] (03CR) 10jerkins-bot: [V: 04-1] maps: Add kafka helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [12:17:03] (03Merged) 10jenkins-bot: Remove redundant patrolmarks flag from patroller usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759564 (https://phabricator.wikimedia.org/T300913) (owner: 104nn1l2) [12:17:04] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=aqs1009.eqiad.wmnet [12:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:21] nn1l2: please test on mwdebug1001 [12:17:38] (03CR) 10Jgiannelos: "Updated the WIP patch from the preparation of the tegola release. These are the 2 utils I have already used to interract with kafka for th" [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [12:17:55] zabe: does it matter in which order your patch is synced? [12:18:02] no [12:18:17] ok great [12:18:44] LGTM [12:19:10] syncing [12:19:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:22] for zabe's patch, I guess everything except tests/ needs to be synced [12:19:42] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:759564|Remove redundant patrolmarks flag from patroller usergroup (T300913)]] (duration: 00m 48s) [12:19:44] yep [12:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:46] T300913: Remove redundant patrolmarks - https://phabricator.wikimedia.org/T300913 [12:20:03] (03PS4) 10Majavah: Migrate $wmfRealm calls to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:20:18] (03PS1) 10MSantos: tegola: configure eqiad pregeneration schedule [deployment-charts] - 10https://gerrit.wikimedia.org/r/760543 [12:20:22] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:20:26] (03PS4) 10Jgiannelos: maps: Add kafka helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) [12:21:12] (03Merged) 10jenkins-bot: Migrate $wmfRealm calls to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759300 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [12:21:18] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:21:32] zabe: can you test that everything still seems to load on mwdebug1001? [12:21:41] yes, looking [12:22:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1005.eqiad.wmnet with reason: Switch to plain disk storage [12:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1005.eqiad.wmnet with reason: Switch to plain disk storage [12:22:18] my plan is to sync multiversion/ with one command, wmf-config/ with another and w/robots.php with a third [12:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:23:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:35] taavi: pages still seems to work, robots.php still seems to work and nothing in logstash, so I think we are good to go [12:24:41] sync plan souds good to me [12:25:11] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host mc2043.mgmt.codfw.wmnet with reboot policy FORCED [12:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:41] !log taavi@deploy1002 Synchronized multiversion: Config: [[gerrit:759300|Migrate $wmfRealm calls to $wmgRealm (T45956)]] (1/3) (duration: 00m 48s) [12:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:44] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [12:25:53] (03PS2) 10MSantos: maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 [12:26:30] (03CR) 10jerkins-bot: [V: 04-1] maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 (owner: 10MSantos) [12:26:37] !log taavi@deploy1002 Synchronized wmf-config: Config: [[gerrit:759300|Migrate $wmfRealm calls to $wmgRealm (T45956)]] (2/3) (duration: 00m 48s) [12:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:26] !log taavi@deploy1002 Synchronized w/robots.php: Config: [[gerrit:759300|Migrate $wmfRealm calls to $wmgRealm (T45956)]] (3/3) (duration: 00m 48s) [12:27:28] (03PS2) 10Majavah: Ensure GlobalBlocking is not loaded without CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760202 (https://phabricator.wikimedia.org/T299371) [12:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:54] zabe: fully synced out! logspam-watch doesn't show anything new, which is a good sign [12:28:07] (03CR) 10Majavah: [C: 03+2] Ensure GlobalBlocking is not loaded without CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760202 (https://phabricator.wikimedia.org/T299371) (owner: 10Majavah) [12:28:37] (03PS3) 10MSantos: maps: remove tilerator logic from planet_sync [puppet] - 10https://gerrit.wikimedia.org/r/759894 [12:28:51] (03Merged) 10jenkins-bot: Ensure GlobalBlocking is not loaded without CentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760202 (https://phabricator.wikimedia.org/T299371) (owner: 10Majavah) [12:30:35] taavi: yep, thanks for your help :) [12:31:18] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:760202|Ensure GlobalBlocking is not loaded without CentralAuth (T299371)]] (1/2) (duration: 00m 48s) [12:31:20] !log revert kubestagetcd1005 to plain disk storage [12:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:22] T299371: Migrate globalblocks table to use central ids instead of usernames - https://phabricator.wikimedia.org/T299371 [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd1006.eqiad.wmnet with reason: Switch to plain disk storage [12:32:08] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:760202|Ensure GlobalBlocking is not loaded without CentralAuth (T299371)]] (2/2) (duration: 00m 48s) [12:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd1006.eqiad.wmnet with reason: Switch to plain disk storage [12:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:20] anyone have anything else to deploy? [12:32:39] (03CR) 10Jbond: Revert "elastic: install elasticsearch-oss from component" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758908 (owner: 10Ryan Kemper) [12:32:41] !log UTC morning deploys done [12:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:04] taavi, do you have time for a quick question? [12:33:07] yes [12:33:26] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:33:47] did you yourself check if my changes would affect any gadget definitions or abuse filters that may be checking just for `patrolmarks`? [12:34:02] If so, I just want to know how. [12:34:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:34:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:10] I want to learn. Thanks [12:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:14] !log revert kubestagetcd1006 to plain disk storage [12:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:11] nn1l2: yes. I used the global-search.toolforge.org to check for all css/js/json pages and gadgets-definition pages on all wikis, and search-filters.toolforge.org (not public, you need to have permissions to see private filters globally) to search for all abuse filters [12:35:22] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:35:32] (03PS1) 10Jbond: DO NOT MERGE: test CI [puppet] - 10https://gerrit.wikimedia.org/r/760545 [12:37:47] Thanks! Doesn't this page (https://meta.wikimedia.org/wiki/User:OldBee/LiveRCSiteConfig.js) need editing? [12:38:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:08] (03PS1) 10Cathal Mooney: Add new function to return device 'underlay' network links. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) [12:39:46] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [12:40:18] it's a script (config file?) that hasn't been touched in years doesn't look like something that will actually break if I don't touch it, so I don't think that's necessary [12:40:30] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2043.mgmt.codfw.wmnet with reboot policy FORCED [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:39] s/years doesn't/years and doesn't [12:42:28] Thanks! [12:43:51] (03PS2) 10Zabe: Add ombuds.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) [12:44:02] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [12:44:53] !log installing ruby2.7 security updates [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:52] (03CR) 10MSantos: [C: 03+1] maps: Add kafka helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [12:47:25] (03PS2) 10Cathal Mooney: New function and changes to wmf-netbox plugin to support EVPN config. [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) [12:48:35] 10SRE, 10cloud-services-team, 10User-MoritzMuehlenhoff: Investigate use of hp-asrd on HPE servers - https://phabricator.wikimedia.org/T221939 (10Aklapper) [12:48:54] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/760537 (owner: 10L10n-bot) [12:50:24] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:55:06] (03PS2) 10Ladsgroup: admin: Add bwang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759845 (https://phabricator.wikimedia.org/T300664) [12:55:16] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:57:18] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add bwang to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/759845 (https://phabricator.wikimedia.org/T300664) (owner: 10Ladsgroup) [12:57:42] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:57:49] (03PS2) 10Ladsgroup: admin: Adding AUgolnikova [puppet] - 10https://gerrit.wikimedia.org/r/759846 (https://phabricator.wikimedia.org/T300878) [12:57:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Adding AUgolnikova [puppet] - 10https://gerrit.wikimedia.org/r/759846 (https://phabricator.wikimedia.org/T300878) (owner: 10Ladsgroup) [12:59:45] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Product-Analytics, and 2 others: Requesting access to Superset for AUgolnikova - https://phabricator.wikimedia.org/T300878 (10Ladsgroup) 05Open→03Resolved You should have access now (in thirty minutes), reopen if that's not the case [13:00:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ladsgroup: Requesting access to MediaWiki deployment shell for bwang - https://phabricator.wikimedia.org/T300664 (10Ladsgroup) 05Open→03Resolved You should have access now (in thirty minutes), reopen if that's not the case [13:00:38] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10fgiunchedi) [13:04:02] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:06:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [13:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:32] (03CR) 10Jgiannelos: [C: 03+1] tegola: configure eqiad pregeneration schedule [deployment-charts] - 10https://gerrit.wikimedia.org/r/760543 (owner: 10MSantos) [13:09:50] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:11:18] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:12:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [13:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1020.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1020.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:14:11] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [13:14:40] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:14:44] !log update ferm on bullseye [13:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:06] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:20:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [13:31:40] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:32:21] (03PS1) 10Zabe: Start writing to $wmgConfigDir the same value as to $wmfConfigDir [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760570 (https://phabricator.wikimedia.org/T45956) [13:34:06] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:41:52] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:52:37] (03CR) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:03:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:08:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:11:57] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:13:33] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:14:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:14:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298554)', diff saved to https://phabricator.wikimedia.org/P20193 and previous config saved to /var/cache/conftool/dbconfig/20220207-141452-ladsgroup.json [14:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:59] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:15:07] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10LWyatt) a:03JMinor [14:15:28] (03PS6) 10Jelto: gitlab_runner: execute gitlab-runner as non-root [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) [14:19:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:20:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33591/console" [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:20:23] (03CR) 10Jelto: [V: 03+1] gitlab_runner: execute gitlab-runner as non-root (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:22:01] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298554)', diff saved to https://phabricator.wikimedia.org/P20194 and previous config saved to /var/cache/conftool/dbconfig/20220207-142445-ladsgroup.json [14:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:50] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:30:49] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:35:25] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:37:43] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Set spark maxPartitionBytes to hadoop dfs block size [puppet] - 10https://gerrit.wikimedia.org/r/758529 (https://phabricator.wikimedia.org/T300299) (owner: 10Ottomata) [14:38:48] (03PS1) 10Arturo Borrero Gonzalez: cmd-checklist-runner: refresh code [puppet] - 10https://gerrit.wikimedia.org/r/760574 [14:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P20195 and previous config saved to /var/cache/conftool/dbconfig/20220207-143950-ladsgroup.json [14:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:46:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:47:28] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) 05Resolved→03Open Sorry for reopening this; woul... [14:51:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:53:15] (03PS6) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [14:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P20196 and previous config saved to /var/cache/conftool/dbconfig/20220207-145454-ladsgroup.json [14:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:51] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:56:11] (03CR) 10Volans: "Replies inline, IMHO I think the potential sql-injection should still be addressed." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [15:00:33] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10cmooney) @jhathaway interesting results. The duplicate ACKs and retransmissions I assume are due to packet loss, and the fact that this is happening from sretest1002 rules out... [15:00:59] (03CR) 10JMeybohm: [C: 03+1] install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:07:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:10:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298554)', diff saved to https://phabricator.wikimedia.org/P20197 and previous config saved to /var/cache/conftool/dbconfig/20220207-150959-ladsgroup.json [15:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:05] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:10:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:10:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [15:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298554)', diff saved to https://phabricator.wikimedia.org/P20198 and previous config saved to /var/cache/conftool/dbconfig/20220207-151018-ladsgroup.json [15:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:53] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:13:55] (03CR) 10Elukey: [C: 03+2] install_server: add partman recipe kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/759716 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:17:07] (03PS1) 10Elukey: install_server: set new partman recipe for ml-serve2005 [puppet] - 10https://gerrit.wikimedia.org/r/760579 (https://phabricator.wikimedia.org/T300744) [15:19:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298554)', diff saved to https://phabricator.wikimedia.org/P20199 and previous config saved to /var/cache/conftool/dbconfig/20220207-151917-ladsgroup.json [15:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:27] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:26:49] (03CR) 10Elukey: [C: 03+2] install_server: set new partman recipe for ml-serve2005 [puppet] - 10https://gerrit.wikimedia.org/r/760579 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:28:50] (03PS2) 10JMeybohm: Allow to configure a different port for ProxyFetch monitor [debs/pybal] - 10https://gerrit.wikimedia.org/r/759749 (https://phabricator.wikimedia.org/T301137) [15:30:54] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS bullseye [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P20200 and previous config saved to /var/cache/conftool/dbconfig/20220207-153424-ladsgroup.json [15:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:57] !log uploaded scap 4.3-0 to apt.w.o - T300804 [15:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:05] T300804: Deploy Scap version 4.3.0 - https://phabricator.wikimedia.org/T300804 [15:38:13] yay! [15:40:30] !log updated scap to 4.3.0 on A:mw-canary, A:parsoid-canary, A:mw-jobrunner-canary, A:restbase-canary - T300804 [15:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:39] !log jayme@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [15:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:41:55] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [15:43:50] (03PS1) 10Elukey: install_server: avoid swap in d-i when using kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/760582 (https://phabricator.wikimedia.org/T300744) [15:44:09] !log jayme@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 02m 30s) [15:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:47:49] !log installing pillow security updates [15:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:23] moritzm: that's just you taking a nap, right? :D [15:48:29] lol [15:48:52] :-) [15:49:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P20201 and previous config saved to /var/cache/conftool/dbconfig/20220207-154928-ladsgroup.json [15:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:25] (03CR) 10Elukey: [C: 03+2] install_server: avoid swap in d-i when using kubernetes-node-overlay.cfg [puppet] - 10https://gerrit.wikimedia.org/r/760582 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [15:59:04] 10SRE, 10SRE-Access-Requests: Requesting access to Superset/Turnilo for Kinneretgordon - https://phabricator.wikimedia.org/T301098 (10Maryana) Hi! @KinneretG's manager here – looks like she was blocked from Superset over the weekend for some reason. Could you please reinstate her permissions? Thanks! [16:02:40] 10SRE, 10Traffic-Icebox, 10SecTeam-Processed: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10sbassett) [16:03:33] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10MoritzMuehlenhoff) Two things/tests here which came to my mind: 1. The reproducer pulls packages from security.debian.org and mirrors.debian.org (e.g. firefox-esr and chromium... [16:03:42] 10ops-eqiad, 10Traffic-Icebox: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) [16:04:14] 10ops-eqiad, 10Traffic-Icebox: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) [16:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298554)', diff saved to https://phabricator.wikimedia.org/P20203 and previous config saved to /var/cache/conftool/dbconfig/20220207-160433-ladsgroup.json [16:04:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:04:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [16:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:38] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298554)', diff saved to https://phabricator.wikimedia.org/P20204 and previous config saved to /var/cache/conftool/dbconfig/20220207-160441-ladsgroup.json [16:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS bullseye [16:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:03] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:05:57] !log migrating instances off ganeti1021 [16:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:29] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:13:15] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:13:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cmd-checklist-runner: refresh code [puppet] - 10https://gerrit.wikimedia.org/r/760574 (owner: 10Arturo Borrero Gonzalez) [16:14:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298554)', diff saved to https://phabricator.wikimedia.org/P20205 and previous config saved to /var/cache/conftool/dbconfig/20220207-161430-ladsgroup.json [16:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:17:58] (03PS6) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) [16:17:59] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:19:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [16:19:35] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) [16:22:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2029.codfw.wmnet with OS buster [16:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:23] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch to plain disk storage [16:22:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2029.codfw.w... [16:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2001.codfw.wmnet with reason: Switch to plain disk storage [16:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:27] 10SRE, 10envoy, 10serviceops: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) 05Open→03Resolved [16:24:45] !log switch kubestagetcd2001 to plain disk storage [16:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:30] 10SRE, 10Observability-Logging, 10User-fgiunchedi: Ingest webrequest sampled 1000 into logstash - https://phabricator.wikimedia.org/T301110 (10colewhite) Related: https://gerrit.wikimedia.org/r/c/operations/puppet/+/702163 [16:29:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch to plain disk storage [16:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch to plain disk storage [16:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:30] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33592/console" [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [16:29:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P20206 and previous config saved to /var/cache/conftool/dbconfig/20220207-162935-ladsgroup.json [16:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:00] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: Add kafka helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/745297 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [16:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T1630) [16:30:17] !log switch kubestagetcd2002 to plain disk storage [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:47] 10SRE, 10Observability-Metrics, 10Sustainability (Incident Followup): prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) - https://phabricator.wikimedia.org/T222102 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The dashboard at https://grafana.wikimedia.org/d/... [16:33:12] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:35:53] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [16:36:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) 05Open→03Resolved complete [16:36:55] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [16:37:46] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:37:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) 05Open→03Resolved complete [16:38:34] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10mpopov) Hello! I would prefer to not have an allowlist for external domains, but if the final decision is to have one then... [16:38:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch to plain disk storage [16:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch to plain disk storage [16:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:39] (03PS1) 10Arturo Borrero Gonzalez: wmcs: CmdCheckList: refresh parser [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/760608 [16:41:26] !log switch kubestagetcd2003 to plain disk storage [16:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P20207 and previous config saved to /var/cache/conftool/dbconfig/20220207-164439-ladsgroup.json [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2029.codfw.wmnet with OS buster [16:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2029.codfw.wmnet... [16:53:09] (03PS2) 10Jbond: rake_tests: enable single_quote_string_with_variables [puppet] - 10https://gerrit.wikimedia.org/r/760545 [16:53:39] (03PS3) 10Jbond: rake_tests: enable single_quote_string_with_variables [puppet] - 10https://gerrit.wikimedia.org/r/760545 (https://phabricator.wikimedia.org/T300928) [16:54:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: CmdCheckList: refresh parser [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/760608 (owner: 10Arturo Borrero Gonzalez) [16:54:40] (03CR) 10jerkins-bot: [V: 04-1] rake_tests: enable single_quote_string_with_variables [puppet] - 10https://gerrit.wikimedia.org/r/760545 (https://phabricator.wikimedia.org/T300928) (owner: 10Jbond) [16:54:54] (03PS1) 10BBlack: Add netflow6001 to kafka custom ferm [puppet] - 10https://gerrit.wikimedia.org/r/760613 (https://phabricator.wikimedia.org/T282787) [16:54:56] (03PS1) 10BBlack: Add ops-drmrs to alertmanager config [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) [16:54:58] (03PS1) 10BBlack: drmrs: add vk delivery error alerting [puppet] - 10https://gerrit.wikimedia.org/r/760615 (https://phabricator.wikimedia.org/T282787) [16:55:00] (03PS1) 10BBlack: Add drmrs to smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) [16:55:02] (03PS1) 10BBlack: smokeping: monitor eqsin switch [puppet] - 10https://gerrit.wikimedia.org/r/760617 (https://phabricator.wikimedia.org/T186650) [16:55:09] (03PS1) 10Andrew Bogott: cloud-vps codfw1dev: switch to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/760618 [16:55:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS buster [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:15] (03PS1) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [16:55:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ganeti2030.codfw.w... [16:56:33] (03PS2) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [16:58:16] (03PS4) 10Jbond: rake_tests: enable single_quote_string_with_variables [puppet] - 10https://gerrit.wikimedia.org/r/760545 (https://phabricator.wikimedia.org/T300928) [16:58:45] 10SRE, 10ops-codfw, 10Discovery-Search, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946 (10Papaul) [16:58:47] 10SRE, 10ops-eqiad, 10Traffic-Icebox: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10wiki_willy) a:03Cmjohnson [16:59:18] 10ops-eqiad, 10DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T300820 (10BBlack) @cmjohnson - yes, we need to depool lvs1015 before working on this. I just looked at the graph again, and looks like the errors peaked and vanished late last week? https://librenms.wikimedia.org/graphs/to=16... [16:59:24] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:59:26] 10SRE, 10ops-codfw, 10Discovery-Search, 10decommission-hardware: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T300946 (10Papaul) 05Open→03Resolved complete [16:59:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298554)', diff saved to https://phabricator.wikimedia.org/P20208 and previous config saved to /var/cache/conftool/dbconfig/20220207-165944-ladsgroup.json [16:59:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:59:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:49] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20209 and previous config saved to /var/cache/conftool/dbconfig/20220207-165952-ladsgroup.json [16:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:16] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Papaul) a:03Papaul [17:02:56] (03CR) 10Jbond: [C: 03+2] rake_tests: enable single_quote_string_with_variables [puppet] - 10https://gerrit.wikimedia.org/r/760545 (https://phabricator.wikimedia.org/T300928) (owner: 10Jbond) [17:03:18] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33595/console" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [17:11:46] (03CR) 10Elukey: [C: 03+1] Enable nodePort 30021 for ingressgateway status [deployment-charts] - 10https://gerrit.wikimedia.org/r/759726 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [17:12:32] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps codfw1dev: switch to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/760618 (owner: 10Andrew Bogott) [17:16:50] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:17:52] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:33] (03CR) 10JMeybohm: [C: 03+1] "I don't like the name very much. Maybe something like: profile::docker::engine::force_default_docker_storage would be more speaking?" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20210 and previous config saved to /var/cache/conftool/dbconfig/20220207-172343-ladsgroup.json [17:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:48] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [17:26:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS buster [17:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ganeti2030.codfw.wmnet... [17:26:48] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host mc2042.mgmt.codfw.wmnet with reboot policy FORCED [17:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) [17:28:58] (03CR) 10Elukey: profile::docker::engine: add param to ignore docker storage settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:29:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff this is complete [17:31:47] (03PS4) 10Elukey: profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) [17:32:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:33:21] papaul: hey. you're looking for someone re: db2096? [17:33:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33599/console" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:36:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33600/console" [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:37:28] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:38:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20211 and previous config saved to /var/cache/conftool/dbconfig/20220207-173848-ladsgroup.json [17:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:08] (03PS8) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [17:42:22] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2042.mgmt.codfw.wmnet with reboot policy FORCED [17:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:23] 10SRE, 10DC-Ops, 10cloud-services-team (Kanban): Supporting new hardware in older debian releases - https://phabricator.wikimedia.org/T301162 (10nskaggs) [17:45:26] (03CR) 10SBassett: [C: 03+1] "The security-team is fine with these as Wikimedia origin sources." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [17:48:22] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:51:36] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS buster [17:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:45] (03CR) 10Ahmon Dancy: [C: 03+1] "This is good to go" [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [17:53:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20212 and previous config saved to /var/cache/conftool/dbconfig/20220207-175352-ladsgroup.json [17:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:32] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:56:10] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2019.wmnet [17:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:15] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2020.wmnet [17:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:36] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:58:09] (03CR) 10JMeybohm: [C: 03+1] profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [17:58:16] PROBLEM - Host restbase2019 is DOWN: PING CRITICAL - Packet loss = 100% [17:58:54] PROBLEM - Host restbase2020 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T1800). [18:01:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:15] restbase hosts are me, oops [18:02:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase2019.codfw.wmnet with reason: Firmware upgrade [18:02:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase2019.codfw.wmnet with reason: Firmware upgrade [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:40] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on restbase2020.codfw.wmnet with reason: Firmware upgrade [18:02:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on restbase2020.codfw.wmnet with reason: Firmware upgrade [18:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:08] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [18:07:28] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:07:46] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:07:54] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:08:20] PROBLEM - MariaDB Replica IO: x1 on db2115 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:08:20] PROBLEM - MariaDB Replica IO: x1 on db2131 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:08:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20213 and previous config saved to /var/cache/conftool/dbconfig/20220207-180857-ladsgroup.json [18:08:58] ^ Amir1 that you? [18:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:02] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:09:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:09:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [18:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [18:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:38] jynus: no, that's me, damnit. [18:09:46] ok, no worries [18:10:01] jynus: sorry I just saw it [18:10:21] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Papaul) @Kormat since the server is down for maintenance can i take advantage of this downtime and move the server to a 1G rack since it is in a 10G rack? From B4 to B6 [18:10:31] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) Kormat db2096 down for hw maintenance https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:10:31] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2115 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) Kormat db2096 down for hw maintenance https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:10:31] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2131 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) Kormat db2096 down for hw maintenance https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:10:48] PROBLEM - Host db2096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:31] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Kormat) >>! In T300965#7690183, @Papaul wrote: > @Kormat since the server is down for maintenance can i take advantage of this downtime and move the server to a 1G rack since it is in a 10G ra... [18:15:38] RECOVERY - Host restbase2019 is UP: PING OK - Packet loss = 0%, RTA = 31.56 ms [18:17:44] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In [18:17:44] content returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [18:17:48] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:17:54] RECOVERY - Host restbase2020 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:19:02] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [18:20:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS buster [18:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:50] RECOVERY - Host db2096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 73.99 ms [18:22:12] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) [18:22:27] (03CR) 104nn1l2: [C: 04-1] Change / add some namespaces and aliases on arywiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [18:22:51] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) a:05Papaul→03Cmjohnson updated 2019 and 2020, resolving [18:25:56] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:26:47] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300168 (10Cmjohnson) [18:26:52] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300168 (10Cmjohnson) 05Open→03Resolved [18:27:27] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300165 (10Cmjohnson) [18:27:29] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300165 (10Cmjohnson) 05Open→03Resolved [18:28:24] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:29:56] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:30:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:30:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [18:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298554)', diff saved to https://phabricator.wikimedia.org/P20214 and previous config saved to /var/cache/conftool/dbconfig/20220207-183059-ladsgroup.json [18:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:03] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:31:27] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Papaul) 05Open→03Resolved @Kormat complete [18:32:20] 10SRE, 10ops-codfw: Possible cable issue on restbase2010 management interface - https://phabricator.wikimedia.org/T299426 (10Papaul) 05Open→03Resolved a:03Papaul IDRAC upgrade complete [18:32:28] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:33:39] (03PS1) 10Andrew Bogott: openstack puppetmaster: add missing ) in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/760631 [18:34:05] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Kormat) Running `mysqlcheck --all-databases` now. [18:34:14] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:34:36] RECOVERY - MariaDB Replica IO: x1 on db2115 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:34:43] (03CR) 10Majavah: [C: 03+1] "this is clearly my fault! sorry about that" [puppet] - 10https://gerrit.wikimedia.org/r/760631 (owner: 10Andrew Bogott) [18:41:26] RECOVERY - MariaDB Replica IO: x1 on db2131 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:42:23] (03PS1) 10Andrew Bogott: labspuppetbackend: fix a race condition with logfile ownership [puppet] - 10https://gerrit.wikimedia.org/r/760632 [18:42:29] (03CR) 10Andrew Bogott: [C: 03+2] openstack puppetmaster: add missing ) in ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/760631 (owner: 10Andrew Bogott) [18:43:28] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:43:40] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [18:48:00] PROBLEM - MariaDB Replica Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2618.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:50:36] (03PS4) 10Clare Ming: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) (owner: 10Bernard Wang) [18:54:38] (03CR) 10Dzahn: "I'd be happy to merge this IF it is actually possible to rename wikis. Is it going to realistically happen though and this is blocking it?" [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [18:54:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298554)', diff saved to https://phabricator.wikimedia.org/P20215 and previous config saved to /var/cache/conftool/dbconfig/20220207-185459-ladsgroup.json [18:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:05] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:56:54] (03CR) 10Dzahn: [C: 03+2] "I was just asking to see what "is the worst that could happen" as I don't claim I would really be a good Perl reviewer. But based on the l" [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [18:57:34] (03CR) 10AGueyte: Update Event Stream for IPInfo events (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [18:58:40] (03PS8) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [18:59:20] (03CR) 10Zabe: Add ombuds.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [19:00:05] RoanKattouw and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T1900). [19:00:05] cjming, cirno, and dduvall: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:13] o/ [19:00:22] o/ [19:00:31] o/ [19:00:32] (03CR) 10Dzahn: [C: 03+2] "ok, gotcha. thanks, i'll add it" [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [19:00:34] o/ [19:00:36] dduvall: do you want to self-service or do you want someone else to deploy your patch? [19:01:06] i can do it. thank you [19:01:09] i'll wait until the end [19:01:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:01:12] ok, sure [19:01:16] it's a beta-only config patch [19:01:36] ah, in that case I'll just merge+rebase to get it out of the way at the start [19:01:45] (03PS2) 10Majavah: beta: Discover etcd servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [19:01:49] (03CR) 10Majavah: [C: 03+2] beta: Discover etcd servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [19:02:02] (03PS5) 10Majavah: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) (owner: 10Bernard Wang) [19:02:06] (03CR) 10Majavah: [C: 03+2] Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) (owner: 10Bernard Wang) [19:02:27] (03Merged) 10jenkins-bot: beta: Discover etcd servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759543 (https://phabricator.wikimedia.org/T296771) (owner: 10Dduvall) [19:02:30] (03CR) 10Dzahn: "deployed. host ombuds.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [19:02:59] taavi: sounds good, too. thank you! [19:03:07] (03PS1) 10Jbond: DO NOT MERGE: example ci to demonstrate possible securecommand [software/cumin] - 10https://gerrit.wikimedia.org/r/760635 [19:03:11] (03Merged) 10jenkins-bot: Turn on wgVectorLanguageAlertInSidebar for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758890 (https://phabricator.wikimedia.org/T300559) (owner: 10Bernard Wang) [19:03:43] dduvall: your patch should get auto-deployed to beta in the next 30 mins or so [19:03:52] awesome. ty! [19:03:54] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:03] cjming: can you test on mwdebug1001 please? [19:04:10] yup [19:04:26] taavi: looks good [19:04:29] cool, syncing [19:04:33] ty! [19:05:09] .. just realized you now have deployment access too, hopefully you didn't want to try deploying yourself? (sorry for not realizing earlier!) [19:05:22] no worries! [19:05:30] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:758890|Turn on wgVectorLanguageAlertInSidebar for all wikis (T300559)]] (duration: 00m 49s) [19:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:34] T300559: Enable VectorLanguageAlertInSidebar on all wikis - https://phabricator.wikimedia.org/T300559 [19:05:49] cirno: hi! I think this is your first time deploying a config patch, correct? [19:05:55] yep [19:06:06] do you have the x-wikimedia-debug browser extension installed? [19:06:08] * urbanecm waves too now, happy to help if anything's needed [19:06:15] yes, installed [19:06:22] great [19:06:32] (03CR) 10Jbond: DO NOT MERGE: example ci to demonstrate possible securecommand (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/760635 (owner: 10Jbond) [19:06:34] (03PS3) 10Majavah: wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [19:06:50] (03CR) 10Majavah: [C: 03+2] "deploying with secteam approval on patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [19:07:31] (03Merged) 10jenkins-bot: wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759739 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [19:07:37] it'll take a minute or two for Jenkins to merge your patch and me to pull the patch to mwdebug1001, after that's done I'll ping you here and you can test it using the extension before syncing out to the whole cluster [19:07:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:02] got it, thanks [19:08:06] RECOVERY - Device not healthy -SMART- on restbase2010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=restbase2010&var-datasource=codfw+prometheus/ops [19:08:29] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: example ci to demonstrate possible securecommand [software/cumin] - 10https://gerrit.wikimedia.org/r/760635 (owner: 10Jbond) [19:08:35] cirno: ok, your patch is now available for testing on mwdebug1001.eqiad.wmnet. can you test it using the extension (click the toggle and choose mwdebug1001 if not selected already) and report back here if it works as expected? [19:08:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P20216 and previous config saved to /var/cache/conftool/dbconfig/20220207-191003-ladsgroup.json [19:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:20] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10JMinor) While these are interesting speculations about the TOU, IANAL, and from a policy perspective, this is exactly aligned with the intentions of the policy change. This fills a gap in knowledge equ... [19:10:30] taavi: should I choose "XHGui" or something else in the extension? [19:10:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:54] cirno: no, you can leave everything else as default other than what I just said [19:11:28] the rest are various tools for performance testing and similar, we're currently only interested in routing the traffic to a mwdebug server [19:11:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) [19:13:59] urbanecm: (since you were around) I see some "Uncaught RuntimeException: The UdpSocket to 127.0.0.1:10514 has been closed and can not be written to anymore" messages on the mwdebug logstash dashboard, but those seem unrelated? I'm going to ignore them unless someone says otherwise [19:14:33] taavi: that's normal and happens on mwdebug servers for quite some time [19:14:39] (it also has a task) [19:15:29] cirno: hi, how is testing going? [19:16:14] well, it work partially: on CORS side it do work, but I found a new issue [19:16:32] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:16:37] maybe I need another patch to solve that, so what should I do [19:16:41] (03PS2) 10Zabe: Add ombuds.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/756733 (https://phabricator.wikimedia.org/T273323) [19:16:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [19:18:07] up to you, I can sync it if it doesn't actively break anything, or I can revert the commit [19:18:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:18:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2044.mgmt.codfw.wmnet with reboot policy FORCED [19:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:56] taavi: I tested again, it seems problem still exist [19:21:33] should I revert then? [19:22:26] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [19:22:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:52] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:23:48] ...not sure, but I suggest yes [19:24:07] left a comment on the task [19:24:22] ack [19:24:29] (03PS1) 10Zabe: Stop writing to $wmfRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760640 (https://phabricator.wikimedia.org/T45956) [19:24:31] (03PS1) 10Majavah: Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760552 [19:24:51] (03CR) 10Majavah: [C: 03+2] Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760552 (owner: 10Majavah) [19:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P20217 and previous config saved to /var/cache/conftool/dbconfig/20220207-192508-ladsgroup.json [19:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:45] (03Merged) 10jenkins-bot: Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760552 (owner: 10Majavah) [19:26:39] what you commented on task seems indeed like a separate issue, but I'm not sure what to do about it [19:26:48] hopefully someone can help on task :( [20:09:56] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20220 and previous config saved to /var/cache/conftool/dbconfig/20220207-201106-ladsgroup.json [20:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:11] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:11:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2045.mgmt.codfw.wmnet with reboot policy FORCED [20:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:07] 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29-32] - https://phabricator.wikimedia.org/T301175 (10RobH) [20:12:31] 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29-32] - https://phabricator.wikimedia.org/T301175 (10RobH) [20:13:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:13:55] (03PS7) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) [20:13:57] (03PS1) 10AOkoth: admin: add jnuche to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/760650 (https://phabricator.wikimedia.org/T301149) [20:14:20] (03PS2) 10AOkoth: admin: add jnuche to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/760650 (https://phabricator.wikimedia.org/T301149) [20:14:53] (03CR) 10RLazarus: [C: 03+1] admin: add jnuche to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/760650 (https://phabricator.wikimedia.org/T301149) (owner: 10AOkoth) [20:15:45] (03CR) 10AOkoth: [C: 03+2] admin: add jnuche to ldap users [puppet] - 10https://gerrit.wikimedia.org/r/760650 (https://phabricator.wikimedia.org/T301149) (owner: 10AOkoth) [20:16:16] 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ganeti10[29-32] - https://phabricator.wikimedia.org/T301175 (10RobH) 05Open→03Invalid I made this already and duplicated by accident. [20:17:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:18:56] 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [20:18:57] !log mforns@deploy1002 Started deploy [airflow-dags/analytics-test@ef5783e]: (no justification provided) [20:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:04] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics-test@ef5783e]: (no justification provided) (duration: 00m 07s) [20:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:18] 10ops-eqiad, 10DC-Ops, 10serviceops, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab100[2|3] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10RobH) [20:19:24] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:23:05] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: old kernel [20:23:07] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: old kernel [20:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:09] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Grant Access to wmf, releng, ciadmin for jnuche - https://phabricator.wikimedia.org/T301149 (10Arnoldokoth) Hi, you have been added to the requested groups. Welcome to the foundation! ` aokoth@mwmaint1002:~$ ldapsearch... [20:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P20221 and previous config saved to /var/cache/conftool/dbconfig/20220207-202611-ladsgroup.json [20:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:55] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Grant Access to wmf, releng, ciadmin for jnuche - https://phabricator.wikimedia.org/T301149 (10Arnoldokoth) 05Open→03Resolved a:03Arnoldokoth [20:30:22] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:30:51] !log mforns@deploy1002 Started deploy [airflow-dags/analytics-test@9afb96d]: (no justification provided) [20:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:59] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics-test@9afb96d]: (no justification provided) (duration: 00m 08s) [20:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:12] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:31:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:31:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [20:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T300510)', diff saved to https://phabricator.wikimedia.org/P20222 and previous config saved to /var/cache/conftool/dbconfig/20220207-203120-ladsgroup.json [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [20:33:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1129.eqiad.wmnet with OS bullseye [20:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:34] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:34:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2046.mgmt.codfw.wmnet with reboot policy FORCED [20:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:34] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:41:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P20223 and previous config saved to /var/cache/conftool/dbconfig/20220207-204115-ladsgroup.json [20:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:22] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:46:48] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:50:15] 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10RobH) [20:50:39] 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): (Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10RobH) [20:51:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2046.mgmt.codfw.wmnet with reboot policy FORCED [20:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:46] (03CR) 10Dzahn: [C: 03+2] planet: remove Planet Apache links [puppet] - 10https://gerrit.wikimedia.org/r/759922 (owner: 10Majavah) [20:51:58] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:53:10] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [20:53:31] (03CR) 10Dzahn: "thank you, a little bit sad that they removed it" [puppet] - 10https://gerrit.wikimedia.org/r/759922 (owner: 10Majavah) [20:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20225 and previous config saved to /var/cache/conftool/dbconfig/20220207-205620-ladsgroup.json [20:56:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:56:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:26] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:30] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) @Volans hi riccardo before i loop in it services you mentioned that there may be a way for use to get our own service account on this which would be easier to maintain going forward [21:00:04] chrisalbon and accraze: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T2100). [21:03:27] (03PS1) 10Ottomata: airflow - Set SKEIN_CONFIG [puppet] - 10https://gerrit.wikimedia.org/r/760658 (https://phabricator.wikimedia.org/T296527) [21:04:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1129.eqiad.wmnet with OS bullseye [21:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2047.mgmt.codfw.wmnet with reboot policy FORCED [21:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:36] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:06:50] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33607/console" [puppet] - 10https://gerrit.wikimedia.org/r/760658 (https://phabricator.wikimedia.org/T296527) (owner: 10Ottomata) [21:07:26] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow - Set SKEIN_CONFIG [puppet] - 10https://gerrit.wikimedia.org/r/760658 (https://phabricator.wikimedia.org/T296527) (owner: 10Ottomata) [21:09:34] !log otto@deploy1002 Started deploy [airflow-dags/analytics-test@6d936db]: (no justification provided) [21:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:42] !log otto@deploy1002 Finished deploy [airflow-dags/analytics-test@6d936db]: (no justification provided) (duration: 00m 08s) [21:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:17:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [21:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2047.mgmt.codfw.wmnet with reboot policy FORCED [21:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:51] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10Volans) @jbond ping me on IRC when you have time and we can look at it together. If we can get what permissions are needed I think it should be possible to do it. [21:24:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2048.mgmt.codfw.wmnet with reboot policy FORCED [21:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:00] rzl: can you check? ^ [21:27:23] perfect, thank you! [21:28:03] any time [21:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300510)', diff saved to https://phabricator.wikimedia.org/P20227 and previous config saved to /var/cache/conftool/dbconfig/20220207-212830-ladsgroup.json [21:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:36] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [21:30:52] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:36:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:36:44] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:36:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [21:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20228 and previous config saved to /var/cache/conftool/dbconfig/20220207-213650-ladsgroup.json [21:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:54] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [21:38:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2048.mgmt.codfw.wmnet with reboot policy FORCED [21:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:04] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,POST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P20229 and previous config saved to /var/cache/conftool/dbconfig/20220207-214335-ladsgroup.json [21:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:52] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:46:34] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2049.mgmt.codfw.wmnet with reboot policy FORCED [21:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [21:50:46] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Kormat) mysqlcheck passed successfully. Starting replication now. [21:55:28] RECOVERY - MariaDB Replica Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P20230 and previous config saved to /var/cache/conftool/dbconfig/20220207-215840-ladsgroup.json [21:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:14] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:00:04] Reedy and sbassett: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220207T2200). [22:00:23] !log volans@cumin2002 START - Cookbook sre.hosts.provision for host mc2055.mgmt.codfw.wmnet with reboot policy FORCED [22:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2049.mgmt.codfw.wmnet with reboot policy FORCED [22:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2050.mgmt.codfw.wmnet with reboot policy FORCED [22:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20231 and previous config saved to /var/cache/conftool/dbconfig/20220207-220218-ladsgroup.json [22:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:24] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [22:05:48] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:08:18] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:10:19] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:11:55] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2055.mgmt.codfw.wmnet with reboot policy FORCED [22:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T300510)', diff saved to https://phabricator.wikimedia.org/P20232 and previous config saved to /var/cache/conftool/dbconfig/20220207-221345-ladsgroup.json [22:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:50] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [22:14:52] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:15:39] (03CR) 10Dzahn: [C: 03+1] miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [22:16:50] (03CR) 10Dzahn: [C: 03+1] Add ingress support to miscweb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [22:17:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2050.mgmt.codfw.wmnet with reboot policy FORCED [22:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P20233 and previous config saved to /var/cache/conftool/dbconfig/20220207-221723-ladsgroup.json [22:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:56] (03CR) 10Dzahn: [C: 03+1] "Do you want me to fill out the list of gateway host names?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [22:21:01] (03CR) 10Bking: [V: 03+1] elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:21:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2051.mgmt.codfw.wmnet with reboot policy FORCED [22:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:43] !log begin opensearch upgrade (eqiad) T299168 [22:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:47] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [22:26:56] (03PS1) 10Ottomata: Remove airflow extras dask and papermill [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760673 [22:26:58] (03PS1) 10Ottomata: Update changelog [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760674 [22:31:44] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet, logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:32:04] ^^ expected [22:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P20234 and previous config saved to /var/cache/conftool/dbconfig/20220207-223228-ladsgroup.json [22:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:58] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:33:14] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1023.eqiad.wmnet, logstash1025.eqiad.wmnet, logstash1024.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:35:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2051.mgmt.codfw.wmnet with reboot policy FORCED [22:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:32] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [22:37:49] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:40:52] (03CR) 10Brennen Bearnes: logspam: Read log files more efficiently (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758962 (owner: 10Ahmon Dancy) [22:42:04] (03CR) 10Bking: [V: 03+1 C: 03+2] elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:44:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2052.mgmt.codfw.wmnet with reboot policy FORCED [22:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:23] !log T294805 puppet-merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/736118 [22:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:31] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [22:46:16] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Update pip-requirements for 2.1.4-py3.7-2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/759773 (owner: 10Ottomata) [22:46:25] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove airflow extras dask and papermill [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760673 (owner: 10Ottomata) [22:46:37] (03PS2) 10Ottomata: Add ipython and Update changelog [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760674 [22:47:02] (03PS3) 10Ottomata: Add ipython and Update changelog [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760674 [22:47:12] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298554)', diff saved to https://phabricator.wikimedia.org/P20235 and previous config saved to /var/cache/conftool/dbconfig/20220207-224733-ladsgroup.json [22:47:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add ipython and Update changelog [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/760674 (owner: 10Ottomata) [22:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:38] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [22:48:16] !log T294805 Disabled puppet across all of elastic1* in preparation for bringing new master hosts in [22:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:53:38] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:57:21] !log T294805 Running puppet agent on new master elastic1074.eqiad.wmnet: `sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent` [22:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:25] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [22:58:12] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [22:59:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2052.mgmt.codfw.wmnet with reboot policy FORCED [22:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:52] !log T294805 `sudo systemctl restart elasticsearch_6@production-search-eqiad.service elasticsearch_6@production-search-omega-eqiad.service` on `elastic1074` [22:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:06] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:04:36] !log T294805 Bringing in new master `elastic1081`: `sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent` [23:04:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host mc2053.mgmt.codfw.wmnet with reboot policy FORCED [23:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:40] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:04:41] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:51] !log T294805 Bringing in new master `elastic1081`: `sudo systemctl restart elasticsearch_6@production-search-eqiad.service elasticsearch_6@production-search-psi-eqiad.service` [23:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:14] https://www.irccloud.com/pastebin/bfKSAW8Z/ [23:06:20] !log T294805 Running puppet and restarting elasticsearch services on `elastic1040` to make it no longer a master [23:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:10] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:09:54] !log T294805 Kicking out the final master `elastic1036` (which is also the currently elected leader); after this we'll be back to 3 masters as intended [23:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:00] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:11:28] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:12:28] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:14:00] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:17:03] !log end opensearch upgrade (eqiad) T299168 [23:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:08] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [23:18:32] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:19:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mc2053.mgmt.codfw.wmnet with reboot policy FORCED [23:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [23:27:43] !log T294805 Main search cluster all done, proceeding to `omega` cluster [23:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:48] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:27:57] !log T294805 Bringing in new master `elastic1068` [23:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:38] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:30:52] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:31:16] !log T294805 Bringing in new omega master `elastic1076` [23:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:35] !log T294805 Bringing in new omega master `elastic1057` [23:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:39] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:39:22] !log T294805 Removed old masters `elastic1034` and `elastic1038` (and `elastic1040` was removed earlier) [23:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:38] PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:45:24] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:51:35] (03PS1) 10Bking: elasticsearch: new masters for psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/760684 (https://phabricator.wikimedia.org/T294805) [23:53:17] (03PS2) 10Ryan Kemper: elasticsearch: new masters for psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/760684 (https://phabricator.wikimedia.org/T294805) (owner: 10Bking) [23:53:34] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/760684 (https://phabricator.wikimedia.org/T294805) (owner: 10Bking)