[00:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:07:46] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificaterequests) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:12:45] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificaterequests) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:29:58] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [00:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:38:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952339 [00:38:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952339 (owner: 10TrainBranchBot) [00:40:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:45:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:47:38] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:48] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/952339 (owner: 10TrainBranchBot) [00:58:06] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [01:22:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:27:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:30:38] (03CR) 10Tim Starling: [C: 03+1] "Do you need me to deploy it?" [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [01:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:37:11] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:39:24] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [01:44:52] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [01:57:36] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [02:02:01] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:08:42] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [02:08:54] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:19:41] (03PS1) 10Tim Starling: Customise $wgSitename on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952563 (https://phabricator.wikimedia.org/T181908) [02:29:18] (03PS1) 10Tim Starling: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) [02:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:33:54] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:44] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [02:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:46:52] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [02:54:08] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [03:00:38] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:10:54] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [03:16:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:18:08] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [03:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:23:36] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [03:36:20] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [03:47:26] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [03:54:38] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [04:00:06] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [04:02:14] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:04:54] (03CR) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [04:19:56] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [04:25:24] RECOVERY - Host restbase1027 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [04:50:50] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:28] (03PS1) 10KartikMistry: Update cxserver to 2023-08-22-035703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/952568 (https://phabricator.wikimedia.org/T343450) [05:07:38] RECOVERY - Host restbase1027 is UP: PING WARNING - Packet loss = 80%, RTA = 0.26 ms [05:12:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P51434 and previous config saved to /var/cache/conftool/dbconfig/20230828-051237-ladsgroup.json [05:13:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [05:13:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [05:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T344589)', diff saved to https://phabricator.wikimedia.org/P51435 and previous config saved to /var/cache/conftool/dbconfig/20230828-051349-ladsgroup.json [05:14:58] PROBLEM - Host restbase1027 is DOWN: PING CRITICAL - Packet loss = 100% [05:21:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:22:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:22:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:25:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [05:26:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [05:26:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:26:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T344589)', diff saved to https://phabricator.wikimedia.org/P51436 and previous config saved to /var/cache/conftool/dbconfig/20230828-052610-ladsgroup.json [05:26:12] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:27:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:27:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [05:27:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P51437 and previous config saved to /var/cache/conftool/dbconfig/20230828-052742-ladsgroup.json [05:30:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [05:30:32] !log depool restbase1027 - a lot of ping down events registered, a check up is needed [05:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [05:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T344589)', diff saved to https://phabricator.wikimedia.org/P51438 and previous config saved to /var/cache/conftool/dbconfig/20230828-053045-ladsgroup.json [05:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T344589)', diff saved to https://phabricator.wikimedia.org/P51439 and previous config saved to /var/cache/conftool/dbconfig/20230828-053108-ladsgroup.json [05:31:48] RECOVERY - Host restbase1027 is UP: PING WARNING - Packet loss = 80%, RTA = 0.25 ms [05:34:13] !log powercycle restbase1027 - stopped publishing metrics days ago, no root tty available in mgmt console [05:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:34] RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:36:34] PROBLEM - cassandra-c service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:36:42] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:36:48] PROBLEM - cassandra-b SSL 10.64.48.185:7000 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:36:48] PROBLEM - cassandra-c SSL 10.64.48.186:7000 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:36:48] PROBLEM - cassandra-a SSL 10.64.48.184:7000 on restbase1027 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:36:56] PROBLEM - cassandra-b service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:37:16] RECOVERY - Restbase root url on restbase1027 is OK: HTTP OK: HTTP/1.1 200 - 17735 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/RESTBase [05:39:26] RECOVERY - cassandra-c service on restbase1027 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:39:32] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:39:34] (03PS2) 10Ladsgroup: Stop writing to old extlinks columns in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683) [05:39:43] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to old extlinks columns in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [05:39:48] RECOVERY - cassandra-b service on restbase1027 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:40:23] (03Merged) 10jenkins-bot: Stop writing to old extlinks columns in s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [05:40:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T344589)', diff saved to https://phabricator.wikimedia.org/P51440 and previous config saved to /var/cache/conftool/dbconfig/20230828-054033-ladsgroup.json [05:40:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952202 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [05:41:00] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:952202|Stop writing to old extlinks columns in s4 (T342683)]] [05:41:04] RECOVERY - cassandra-c SSL 10.64.48.186:7000 on restbase1027 is OK: SSL OK - Certificate restbase1027-c valid until 2025-02-21 18:43:55 +0000 (expires in 543 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:41:04] RECOVERY - cassandra-b SSL 10.64.48.185:7000 on restbase1027 is OK: SSL OK - Certificate restbase1027-b valid until 2025-02-21 18:43:53 +0000 (expires in 543 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [05:41:06] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [05:41:11] (03PS1) 10Marostegui: wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/952570 [05:41:29] !log failover m5-master to dbproxy1021 [05:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:32] RECOVERY - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is OK: TCP OK - 0.001 second response time on 10.64.48.185 port 9042 https://phabricator.wikimedia.org/T93886 [05:41:32] RECOVERY - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is OK: TCP OK - 0.000 second response time on 10.64.48.186 port 9042 https://phabricator.wikimedia.org/T93886 [05:42:14] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master [dns] - 10https://gerrit.wikimedia.org/r/952570 (owner: 10Marostegui) [05:42:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T344589)', diff saved to https://phabricator.wikimedia.org/P51441 and previous config saved to /var/cache/conftool/dbconfig/20230828-054221-ladsgroup.json [05:42:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P51442 and previous config saved to /var/cache/conftool/dbconfig/20230828-054247-ladsgroup.json [05:43:50] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:44:06] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:16] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:46:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P51443 and previous config saved to /var/cache/conftool/dbconfig/20230828-054615-ladsgroup.json [05:49:38] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:49:43] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:952202|Stop writing to old extlinks columns in s4 (T342683)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [05:49:48] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [05:50:45] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [05:50:58] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:51:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:55:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P51444 and previous config saved to /var/cache/conftool/dbconfig/20230828-055539-ladsgroup.json [05:56:36] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:952202|Stop writing to old extlinks columns in s4 (T342683)]] (duration: 15m 36s) [05:56:41] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [05:56:42] 10SRE, 10Abstract Wikipedia team, 10Wikifunctions, 10serviceops, and 2 others: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10elukey) Answering to myself - I see in admin_ng that https://gerrit.wikimedia.... [05:57:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P51445 and previous config saved to /var/cache/conftool/dbconfig/20230828-055727-ladsgroup.json [05:57:39] (03CR) 10Ayounsi: [C: 03+1] Allow MGMT_NETWORKS connect to apt server private server on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/952478 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [05:57:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P51446 and previous config saved to /var/cache/conftool/dbconfig/20230828-055751-ladsgroup.json [05:58:06] 10SRE-swift-storage, 10Commons, 10ConfirmEdit (CAPTCHA extension), 10Editing-team, and 3 others: Make SwiftFileBackend::doStoreInternal defer the opening of file handles to stay in the concurrency limit - https://phabricator.wikimedia.org/T230245 (10Krinkle) [05:58:19] (03CR) 10Ayounsi: [C: 03+2] Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [05:58:55] (03Merged) 10jenkins-bot: Revert "mgmt: allow prometheus" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [05:59:17] (03CR) 10Ayounsi: [C: 03+2] Add password support for users [homer/public] - 10https://gerrit.wikimedia.org/r/948538 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [05:59:34] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [05:59:50] (03Merged) 10jenkins-bot: Add password support for users [homer/public] - 10https://gerrit.wikimedia.org/r/948538 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [06:01:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P51447 and previous config saved to /var/cache/conftool/dbconfig/20230828-060121-ladsgroup.json [06:02:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:03:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [06:03:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T343718)', diff saved to https://phabricator.wikimedia.org/P51448 and previous config saved to /var/cache/conftool/dbconfig/20230828-060317-ladsgroup.json [06:03:22] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:10:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P51449 and previous config saved to /var/cache/conftool/dbconfig/20230828-061046-ladsgroup.json [06:12:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P51450 and previous config saved to /var/cache/conftool/dbconfig/20230828-061233-ladsgroup.json [06:15:18] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:15:32] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T344589)', diff saved to https://phabricator.wikimedia.org/P51451 and previous config saved to /var/cache/conftool/dbconfig/20230828-061627-ladsgroup.json [06:16:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [06:16:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [06:16:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T344589)', diff saved to https://phabricator.wikimedia.org/P51452 and previous config saved to /var/cache/conftool/dbconfig/20230828-061651-ladsgroup.json [06:19:34] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:19:48] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T344589)', diff saved to https://phabricator.wikimedia.org/P51453 and previous config saved to /var/cache/conftool/dbconfig/20230828-062304-ladsgroup.json [06:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T344589)', diff saved to https://phabricator.wikimedia.org/P51454 and previous config saved to /var/cache/conftool/dbconfig/20230828-062552-ladsgroup.json [06:25:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:26:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:26:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T344589)', diff saved to https://phabricator.wikimedia.org/P51455 and previous config saved to /var/cache/conftool/dbconfig/20230828-062617-ladsgroup.json [06:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T344589)', diff saved to https://phabricator.wikimedia.org/P51456 and previous config saved to /var/cache/conftool/dbconfig/20230828-062740-ladsgroup.json [06:27:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:27:58] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T344589)', diff saved to https://phabricator.wikimedia.org/P51457 and previous config saved to /var/cache/conftool/dbconfig/20230828-062805-ladsgroup.json [06:29:30] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T344589)', diff saved to https://phabricator.wikimedia.org/P51458 and previous config saved to /var/cache/conftool/dbconfig/20230828-063242-ladsgroup.json [06:34:07] (03PS1) 10Muehlenhoff: Update email address and remove account end date [puppet] - 10https://gerrit.wikimedia.org/r/952655 [06:34:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T344589)', diff saved to https://phabricator.wikimedia.org/P51459 and previous config saved to /var/cache/conftool/dbconfig/20230828-063442-ladsgroup.json [06:34:59] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:41] (03CR) 10Muehlenhoff: [C: 03+2] Update email address and remove account end date [puppet] - 10https://gerrit.wikimedia.org/r/952655 (owner: 10Muehlenhoff) [06:38:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P51460 and previous config saved to /var/cache/conftool/dbconfig/20230828-063810-ladsgroup.json [06:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T343718)', diff saved to https://phabricator.wikimedia.org/P51461 and previous config saved to /var/cache/conftool/dbconfig/20230828-064105-ladsgroup.json [06:41:10] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [06:41:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:45:28] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:45:44] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [06:46:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [06:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P51462 and previous config saved to /var/cache/conftool/dbconfig/20230828-064748-ladsgroup.json [06:49:46] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P51463 and previous config saved to /var/cache/conftool/dbconfig/20230828-064948-ladsgroup.json [06:50:02] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P51464 and previous config saved to /var/cache/conftool/dbconfig/20230828-065316-ladsgroup.json [06:55:34] !log installing nftables bugfix update from bullseye point release [06:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P51465 and previous config saved to /var/cache/conftool/dbconfig/20230828-065611-ladsgroup.json [06:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:59:13] !log installing perf updates on bullseye hosts [06:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: Maintenance [06:59:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2026.codfw.wmnet with reason: Maintenance [06:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51466 and previous config saved to /var/cache/conftool/dbconfig/20230828-065958-ladsgroup.json [07:00:06] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T0700). Please do the needful. [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P51467 and previous config saved to /var/cache/conftool/dbconfig/20230828-070254-ladsgroup.json [07:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51468 and previous config saved to /var/cache/conftool/dbconfig/20230828-070453-ladsgroup.json [07:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T344589)', diff saved to https://phabricator.wikimedia.org/P51469 and previous config saved to /var/cache/conftool/dbconfig/20230828-070823-ladsgroup.json [07:08:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [07:08:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [07:08:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T344589)', diff saved to https://phabricator.wikimedia.org/P51470 and previous config saved to /var/cache/conftool/dbconfig/20230828-070847-ladsgroup.json [07:11:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5003.wikimedia.org [07:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P51471 and previous config saved to /var/cache/conftool/dbconfig/20230828-071117-ladsgroup.json [07:12:24] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:13:52] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:18] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:14:36] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T344589)', diff saved to https://phabricator.wikimedia.org/P51472 and previous config saved to /var/cache/conftool/dbconfig/20230828-071503-ladsgroup.json [07:16:16] (03PS2) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [07:16:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:16:45] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Looking at the `parents` field. So far we've been defining them manually. First in `netops/monitoring.pp` and now... [07:18:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T344589)', diff saved to https://phabricator.wikimedia.org/P51473 and previous config saved to /var/cache/conftool/dbconfig/20230828-071800-ladsgroup.json [07:18:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [07:18:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [07:18:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T344589)', diff saved to https://phabricator.wikimedia.org/P51474 and previous config saved to /var/cache/conftool/dbconfig/20230828-071824-ladsgroup.json [07:18:38] (03PS1) 10Slyngshede: WIP: Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 [07:18:38] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:18:54] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5003.wikimedia.org [07:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026', diff saved to https://phabricator.wikimedia.org/P51475 and previous config saved to /var/cache/conftool/dbconfig/20230828-071959-ladsgroup.json [07:20:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T344589)', diff saved to https://phabricator.wikimedia.org/P51476 and previous config saved to /var/cache/conftool/dbconfig/20230828-072000-ladsgroup.json [07:20:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [07:20:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [07:20:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T344589)', diff saved to https://phabricator.wikimedia.org/P51477 and previous config saved to /var/cache/conftool/dbconfig/20230828-072025-ladsgroup.json [07:20:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2002.codfw.wmnet [07:23:09] (03PS2) 10JMeybohm: Re-apply "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952436 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [07:23:18] (03CR) 10JMeybohm: Re-apply "wikifunctions: Fix networkpolicies" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952436 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [07:23:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [07:24:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [07:24:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [07:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T343718)', diff saved to https://phabricator.wikimedia.org/P51478 and previous config saved to /var/cache/conftool/dbconfig/20230828-072422-ladsgroup.json [07:24:27] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [07:25:00] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10ayounsi) [07:25:07] (03CR) 10JMeybohm: [C: 03+2] Re-apply "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952436 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [07:25:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T344589)', diff saved to https://phabricator.wikimedia.org/P51479 and previous config saved to /var/cache/conftool/dbconfig/20230828-072532-ladsgroup.json [07:25:36] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) 05Open→03Declined Closing in favor of {T344136} and {T326322} [07:26:09] (03Merged) 10jenkins-bot: Re-apply "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952436 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [07:26:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T343718)', diff saved to https://phabricator.wikimedia.org/P51480 and previous config saved to /var/cache/conftool/dbconfig/20230828-072623-ladsgroup.json [07:26:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:26:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:26:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T343718)', diff saved to https://phabricator.wikimedia.org/P51481 and previous config saved to /var/cache/conftool/dbconfig/20230828-072644-ladsgroup.json [07:27:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T344589)', diff saved to https://phabricator.wikimedia.org/P51482 and previous config saved to /var/cache/conftool/dbconfig/20230828-072701-ladsgroup.json [07:27:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast6002.wikimedia.org [07:30:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [07:30:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2002.codfw.wmnet [07:30:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P51483 and previous config saved to /var/cache/conftool/dbconfig/20230828-073009-ladsgroup.json [07:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:31:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2003.codfw.wmnet [07:33:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast6002.wikimedia.org [07:35:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026', diff saved to https://phabricator.wikimedia.org/P51484 and previous config saved to /var/cache/conftool/dbconfig/20230828-073505-ladsgroup.json [07:35:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [07:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P51485 and previous config saved to /var/cache/conftool/dbconfig/20230828-074038-ladsgroup.json [07:41:28] (03PS1) 10JMeybohm: Fix wikifunctions orchestrator not using the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) [07:41:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [07:41:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2003.codfw.wmnet [07:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P51486 and previous config saved to /var/cache/conftool/dbconfig/20230828-074208-ladsgroup.json [07:44:32] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:44:48] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:10] !log fail over Ganeti master in codfw-test to ganeti-test2003 [07:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P51487 and previous config saved to /var/cache/conftool/dbconfig/20230828-074515-ladsgroup.json [07:48:08] PROBLEM - ganeti-wconfd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [07:48:50] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:49:04] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51488 and previous config saved to /var/cache/conftool/dbconfig/20230828-075011-ladsgroup.json [07:50:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance [07:50:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2031.codfw.wmnet with reason: Maintenance [07:50:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2031 (T344589)', diff saved to https://phabricator.wikimedia.org/P51489 and previous config saved to /var/cache/conftool/dbconfig/20230828-075036-ladsgroup.json [07:54:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T344589)', diff saved to https://phabricator.wikimedia.org/P51490 and previous config saved to /var/cache/conftool/dbconfig/20230828-075430-ladsgroup.json [07:55:31] (03CR) 10Elukey: [C: 03+1] Fix wikifunctions orchestrator not using the service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [07:55:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P51491 and previous config saved to /var/cache/conftool/dbconfig/20230828-075544-ladsgroup.json [07:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:57:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P51492 and previous config saved to /var/cache/conftool/dbconfig/20230828-075714-ladsgroup.json [07:57:51] (03PS2) 10JMeybohm: Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [07:58:51] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:59:16] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:59:22] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [08:00:20] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [08:00:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T344589)', diff saved to https://phabricator.wikimedia.org/P51493 and previous config saved to /var/cache/conftool/dbconfig/20230828-080021-ladsgroup.json [08:00:25] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [08:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [08:00:36] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [08:00:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T344589)', diff saved to https://phabricator.wikimedia.org/P51494 and previous config saved to /var/cache/conftool/dbconfig/20230828-080045-ladsgroup.json [08:01:16] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [08:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T343718)', diff saved to https://phabricator.wikimedia.org/P51495 and previous config saved to /var/cache/conftool/dbconfig/20230828-080131-ladsgroup.json [08:01:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:03:29] (03PS1) 10Filippo Giunchedi: jaeger: use . as date separator for storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/952783 (https://phabricator.wikimedia.org/T344954) [08:03:38] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:54] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T343718)', diff saved to https://phabricator.wikimedia.org/P51496 and previous config saved to /var/cache/conftool/dbconfig/20230828-080407-ladsgroup.json [08:04:56] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:01] (03PS3) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [08:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:06:43] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [08:09:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T344589)', diff saved to https://phabricator.wikimedia.org/P51497 and previous config saved to /var/cache/conftool/dbconfig/20230828-080914-ladsgroup.json [08:09:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P51498 and previous config saved to /var/cache/conftool/dbconfig/20230828-080936-ladsgroup.json [08:10:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti-test2001.codfw.wmnet [08:10:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T344589)', diff saved to https://phabricator.wikimedia.org/P51499 and previous config saved to /var/cache/conftool/dbconfig/20230828-081051-ladsgroup.json [08:10:52] (03CR) 10Elukey: [C: 03+1] "After a chat with Janis we realized that the CI's diff is a little messed up, it still shows changes for the allow-uncached-api entry mean" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [08:10:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [08:11:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [08:11:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51500 and previous config saved to /var/cache/conftool/dbconfig/20230828-081117-ladsgroup.json [08:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:12:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T344589)', diff saved to https://phabricator.wikimedia.org/P51501 and previous config saved to /var/cache/conftool/dbconfig/20230828-081220-ladsgroup.json [08:12:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:12:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [08:12:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T344589)', diff saved to https://phabricator.wikimedia.org/P51502 and previous config saved to /var/cache/conftool/dbconfig/20230828-081245-ladsgroup.json [08:13:14] (03CR) 10JMeybohm: [C: 03+2] Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [08:13:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [08:13:54] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:44] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:00] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:04] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:28] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:15:33] (03Merged) 10jenkins-bot: Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) (owner: 10Jforrester) [08:15:58] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51503 and previous config saved to /var/cache/conftool/dbconfig/20230828-081637-ladsgroup.json [08:16:56] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:17:07] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:17:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51504 and previous config saved to /var/cache/conftool/dbconfig/20230828-081736-ladsgroup.json [08:18:40] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:18:42] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:52] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:18:59] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:19:12] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:19:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51505 and previous config saved to /var/cache/conftool/dbconfig/20230828-081913-ladsgroup.json [08:19:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:19:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [08:19:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti-test2001.codfw.wmnet [08:21:00] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T344589)', diff saved to https://phabricator.wikimedia.org/P51506 and previous config saved to /var/cache/conftool/dbconfig/20230828-082231-ladsgroup.json [08:23:51] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [08:24:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P51507 and previous config saved to /var/cache/conftool/dbconfig/20230828-082420-ladsgroup.json [08:24:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [08:24:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2031', diff saved to https://phabricator.wikimedia.org/P51508 and previous config saved to /var/cache/conftool/dbconfig/20230828-082443-ladsgroup.json [08:25:00] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:02] PROBLEM - DPKG on ganeti2013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:31:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [08:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:31:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P51509 and previous config saved to /var/cache/conftool/dbconfig/20230828-083143-ladsgroup.json [08:32:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P51510 and previous config saved to /var/cache/conftool/dbconfig/20230828-083242-ladsgroup.json [08:34:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P51511 and previous config saved to /var/cache/conftool/dbconfig/20230828-083420-ladsgroup.json [08:34:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [08:35:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet [08:35:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet [08:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P51512 and previous config saved to /var/cache/conftool/dbconfig/20230828-083737-ladsgroup.json [08:37:58] (03PS1) 10Jelto: miscweb: update bugzilla image to use gzip again [deployment-charts] - 10https://gerrit.wikimedia.org/r/952807 (https://phabricator.wikimedia.org/T343914) [08:38:18] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:39:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P51513 and previous config saved to /var/cache/conftool/dbconfig/20230828-083926-ladsgroup.json [08:39:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2031 (T344589)', diff saved to https://phabricator.wikimedia.org/P51514 and previous config saved to /var/cache/conftool/dbconfig/20230828-083949-ladsgroup.json [08:41:07] (03PS1) 10Urbanecm: Revert "ltwiki: Disable Growth features" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952808 (https://phabricator.wikimedia.org/T344013) [08:41:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [08:44:04] PROBLEM - Host kubestagetcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:45:18] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:28] RECOVERY - Host kubestagetcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.54 ms [08:45:41] (03CR) 10Elukey: [C: 03+1] "10MB (more than that with metadata) is really a lot, but resource-wise kafka jumbo should handle it fine (10G nics, plenty of space on dis" [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [08:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T343718)', diff saved to https://phabricator.wikimedia.org/P51515 and previous config saved to /var/cache/conftool/dbconfig/20230828-084650-ladsgroup.json [08:46:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [08:46:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [08:47:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [08:47:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [08:47:11] (03PS2) 10Elukey: changeprop: allow retries for liftwing streams with 502 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 [08:47:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T343718)', diff saved to https://phabricator.wikimedia.org/P51516 and previous config saved to /var/cache/conftool/dbconfig/20230828-084710-ladsgroup.json [08:47:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet [08:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P51517 and previous config saved to /var/cache/conftool/dbconfig/20230828-084748-ladsgroup.json [08:48:42] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:48:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [08:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T343718)', diff saved to https://phabricator.wikimedia.org/P51518 and previous config saved to /var/cache/conftool/dbconfig/20230828-084926-ladsgroup.json [08:49:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:49:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:49:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T343718)', diff saved to https://phabricator.wikimedia.org/P51519 and previous config saved to /var/cache/conftool/dbconfig/20230828-084947-ladsgroup.json [08:52:22] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:38] (03PS12) 10Filippo Giunchedi: webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [08:52:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [08:52:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P51520 and previous config saved to /var/cache/conftool/dbconfig/20230828-085243-ladsgroup.json [08:52:52] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [08:52:56] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:53:48] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:14] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove obsolete gerrit checks [puppet] - 10https://gerrit.wikimedia.org/r/948552 (owner: 10Filippo Giunchedi) [08:54:22] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:54:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T344589)', diff saved to https://phabricator.wikimedia.org/P51521 and previous config saved to /var/cache/conftool/dbconfig/20230828-085432-ladsgroup.json [08:54:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [08:54:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [08:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T344589)', diff saved to https://phabricator.wikimedia.org/P51522 and previous config saved to /var/cache/conftool/dbconfig/20230828-085456-ladsgroup.json [08:55:27] (03CR) 10Filippo Giunchedi: [C: 03+2] webperf,arclamp: Rename to clarify as separate roles [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [08:56:28] RECOVERY - DPKG on ganeti2013 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:57:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [08:57:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: Maintenance [08:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51523 and previous config saved to /var/cache/conftool/dbconfig/20230828-085737-ladsgroup.json [08:57:56] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:02] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:38] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:59:17] (03CR) 10Filippo Giunchedi: "This change is ready for review." [labs/private] - 10https://gerrit.wikimedia.org/r/950052 (owner: 10Krinkle) [08:59:23] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] webperf: Remove processors_and_site.yaml [labs/private] - 10https://gerrit.wikimedia.org/r/950052 (owner: 10Krinkle) [08:59:51] (03CR) 10Filippo Giunchedi: [C: 03+2] "All merged, I've also cleaned up private.git in production and its corresponding file for public private.git https://gerrit.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/935523 (owner: 10Krinkle) [09:00:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [09:00:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [09:01:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T344589)', diff saved to https://phabricator.wikimedia.org/P51524 and previous config saved to /var/cache/conftool/dbconfig/20230828-090107-ladsgroup.json [09:02:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51525 and previous config saved to /var/cache/conftool/dbconfig/20230828-090229-ladsgroup.json [09:02:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [09:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51526 and previous config saved to /var/cache/conftool/dbconfig/20230828-090255-ladsgroup.json [09:02:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:03:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:03:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51527 and previous config saved to /var/cache/conftool/dbconfig/20230828-090318-ladsgroup.json [09:05:19] (03CR) 10Muehlenhoff: [C: 03+2] profile::ci::package_builder::extra_packages: Remove Stretch support [puppet] - 10https://gerrit.wikimedia.org/r/952302 (owner: 10Muehlenhoff) [09:06:46] Krinkle, godog: FYI, I'm merging the labs-private patch which is related to your ongoing webperf role rename [09:07:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T344589)', diff saved to https://phabricator.wikimedia.org/P51528 and previous config saved to /var/cache/conftool/dbconfig/20230828-090749-ladsgroup.json [09:08:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [09:08:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [09:08:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51529 and previous config saved to /var/cache/conftool/dbconfig/20230828-090819-ladsgroup.json [09:11:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51530 and previous config saved to /var/cache/conftool/dbconfig/20230828-091140-ladsgroup.json [09:14:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51531 and previous config saved to /var/cache/conftool/dbconfig/20230828-091446-ladsgroup.json [09:15:51] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:16:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P51532 and previous config saved to /var/cache/conftool/dbconfig/20230828-091613-ladsgroup.json [09:17:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P51533 and previous config saved to /var/cache/conftool/dbconfig/20230828-091735-ladsgroup.json [09:18:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [09:19:19] moritzm: ack! thank you, I keep forgetting about the extra merge heh [09:19:31] PROBLEM - cassandra-a service on restbase1027 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:25:53] (03PS1) 10Clément Goubert: mediawiki: Switch to set worker envoy threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/952812 (https://phabricator.wikimedia.org/T344814) [09:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:26:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P51534 and previous config saved to /var/cache/conftool/dbconfig/20230828-092646-ladsgroup.json [09:27:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T343718)', diff saved to https://phabricator.wikimedia.org/P51535 and previous config saved to /var/cache/conftool/dbconfig/20230828-092713-ladsgroup.json [09:27:19] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T343718)', diff saved to https://phabricator.wikimedia.org/P51536 and previous config saved to /var/cache/conftool/dbconfig/20230828-092847-ladsgroup.json [09:29:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P51537 and previous config saved to /var/cache/conftool/dbconfig/20230828-092952-ladsgroup.json [09:30:47] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host contint1002.wikimedia.org [09:31:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P51538 and previous config saved to /var/cache/conftool/dbconfig/20230828-093120-ladsgroup.json [09:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2029', diff saved to https://phabricator.wikimedia.org/P51539 and previous config saved to /var/cache/conftool/dbconfig/20230828-093242-ladsgroup.json [09:37:47] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host contint1002.wikimedia.org [09:39:07] !log begin rebooting lvs hosts in codfw (T344587) [09:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] (03PS1) 10David Caro: p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) [09:41:02] (03CR) 10Muehlenhoff: "Looks good in general, a few comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [09:41:04] !log ignore previous message: s/codfw/ulsfo/ [09:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [09:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P51540 and previous config saved to /var/cache/conftool/dbconfig/20230828-094152-ladsgroup.json [09:41:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:42:08] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4010.ulsfo.wmnet [09:42:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51541 and previous config saved to /var/cache/conftool/dbconfig/20230828-094220-ladsgroup.json [09:42:31] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:43:17] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:21] (03CR) 10Sergio Gimeno: [C: 03+1] Revert "ltwiki: Disable Growth features" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952808 (https://phabricator.wikimedia.org/T344013) (owner: 10Urbanecm) [09:43:35] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51542 and previous config saved to /var/cache/conftool/dbconfig/20230828-094353-ladsgroup.json [09:44:56] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4010.ulsfo.wmnet [09:44:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P51543 and previous config saved to /var/cache/conftool/dbconfig/20230828-094458-ladsgroup.json [09:45:05] (03PS2) 10David Caro: p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) [09:45:41] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.56 ms [09:45:47] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 34.61 ms [09:46:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T344589)', diff saved to https://phabricator.wikimedia.org/P51544 and previous config saved to /var/cache/conftool/dbconfig/20230828-094626-ladsgroup.json [09:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:46:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [09:46:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [09:46:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1214 (T344589)', diff saved to https://phabricator.wikimedia.org/P51545 and previous config saved to /var/cache/conftool/dbconfig/20230828-094650-ladsgroup.json [09:46:55] (03PS3) 10David Caro: p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) [09:47:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [09:47:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [09:47:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [09:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51546 and previous config saved to /var/cache/conftool/dbconfig/20230828-094748-ladsgroup.json [09:47:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: Maintenance [09:48:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: Maintenance [09:48:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51547 and previous config saved to /var/cache/conftool/dbconfig/20230828-094813-ladsgroup.json [09:48:17] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43029/console" [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [09:50:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:05] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:52:28] Ah [09:52:56] yeah that's normal [09:52:59] It's depooled [09:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51548 and previous config saved to /var/cache/conftool/dbconfig/20230828-095308-ladsgroup.json [09:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T344589)', diff saved to https://phabricator.wikimedia.org/P51549 and previous config saved to /var/cache/conftool/dbconfig/20230828-095308-ladsgroup.json [09:53:55] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:54:29] !log disable puppet and stop pybal on lvs4008 for reboot (T344587) [09:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:55:23] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [09:56:19] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T344589)', diff saved to https://phabricator.wikimedia.org/P51550 and previous config saved to /var/cache/conftool/dbconfig/20230828-095658-ladsgroup.json [09:57:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [09:57:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [09:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T344589)', diff saved to https://phabricator.wikimedia.org/P51551 and previous config saved to /var/cache/conftool/dbconfig/20230828-095722-ladsgroup.json [09:57:27] (03PS1) 10Fabfur: admin: add some dotfiles for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/952816 [09:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P51552 and previous config saved to /var/cache/conftool/dbconfig/20230828-095732-ladsgroup.json [09:58:11] fabfur: I suppose the BGP errors above are because of lvs4008 reboot [09:58:15] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:59:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P51553 and previous config saved to /var/cache/conftool/dbconfig/20230828-095859-ladsgroup.json [09:59:03] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:59:13] claime: correct, didn't find a way to silence those [09:59:21] no problem, just double-checking [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1000) [10:00:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T344589)', diff saved to https://phabricator.wikimedia.org/P51554 and previous config saved to /var/cache/conftool/dbconfig/20230828-100005-ladsgroup.json [10:00:09] (03PS1) 10Caenus: Deleting Ns:104 in itwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) [10:00:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [10:00:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance [10:00:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:00:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [10:00:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T344589)', diff saved to https://phabricator.wikimedia.org/P51555 and previous config saved to /var/cache/conftool/dbconfig/20230828-100045-ladsgroup.json [10:01:09] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:02:27] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:02:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [10:02:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet [10:02:35] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T344589)', diff saved to https://phabricator.wikimedia.org/P51556 and previous config saved to /var/cache/conftool/dbconfig/20230828-100443-ladsgroup.json [10:05:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:05:41] (03CR) 10Elukey: [C: 03+2] changeprop: allow retries for liftwing streams with 502 responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/948136 (owner: 10Elukey) [10:06:38] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Switch to set worker envoy threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/952812 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:07:12] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Switch to set worker envoy threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/952812 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:07:45] (03CR) 10FNegri: p:grafana: pass the envoyproxy::enable to autorestart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [10:08:10] (03Merged) 10jenkins-bot: mediawiki: Switch to set worker envoy threads [deployment-charts] - 10https://gerrit.wikimedia.org/r/952812 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2027', diff saved to https://phabricator.wikimedia.org/P51557 and previous config saved to /var/cache/conftool/dbconfig/20230828-100814-ladsgroup.json [10:08:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P51558 and previous config saved to /var/cache/conftool/dbconfig/20230828-100814-ladsgroup.json [10:08:20] (03CR) 10Jelto: [C: 03+2] miscweb: update bugzilla image to use gzip again [deployment-charts] - 10https://gerrit.wikimedia.org/r/952807 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [10:08:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T344589)', diff saved to https://phabricator.wikimedia.org/P51559 and previous config saved to /var/cache/conftool/dbconfig/20230828-100823-ladsgroup.json [10:09:09] (03Merged) 10jenkins-bot: miscweb: update bugzilla image to use gzip again [deployment-charts] - 10https://gerrit.wikimedia.org/r/952807 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [10:09:58] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [10:10:10] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [10:10:21] !log Deploying 952812 for T344814 to mw-debug and mw-api-ext [10:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] T344814: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 [10:10:41] (03PS2) 10Fabfur: admin: add some dotfiles for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/952816 [10:10:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:11:07] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [10:11:24] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [10:11:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:11:51] (03CR) 10Fabfur: [C: 03+2] admin: add some dotfiles for user fabfur [puppet] - 10https://gerrit.wikimedia.org/r/952816 (owner: 10Fabfur) [10:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T343718)', diff saved to https://phabricator.wikimedia.org/P51560 and previous config saved to /var/cache/conftool/dbconfig/20230828-101238-ladsgroup.json [10:12:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:12:44] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:12:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [10:13:59] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4008.ulsfo.wmnet [10:14:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T343718)', diff saved to https://phabricator.wikimedia.org/P51561 and previous config saved to /var/cache/conftool/dbconfig/20230828-101405-ladsgroup.json [10:14:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:14:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [10:14:21] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T343718)', diff saved to https://phabricator.wikimedia.org/P51562 and previous config saved to /var/cache/conftool/dbconfig/20230828-101426-ladsgroup.json [10:14:35] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:12] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:16:00] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [10:16:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4008.ulsfo.wmnet [10:17:40] !log enable puppet and start pybal on lvs4008 for reboot (T344587) [10:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:45] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:18:21] RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:18:43] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P51563 and previous config saved to /var/cache/conftool/dbconfig/20230828-101949-ladsgroup.json [10:20:49] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:21:23] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:22:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [10:22:49] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:06] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [10:23:20] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [10:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2027', diff saved to https://phabricator.wikimedia.org/P51564 and previous config saved to /var/cache/conftool/dbconfig/20230828-102320-ladsgroup.json [10:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P51565 and previous config saved to /var/cache/conftool/dbconfig/20230828-102320-ladsgroup.json [10:23:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P51566 and previous config saved to /var/cache/conftool/dbconfig/20230828-102330-ladsgroup.json [10:23:49] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:13] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:38] (03PS1) 10Kamila Součková: httpd-fcgi: Increase logger packet size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952822 (https://phabricator.wikimedia.org/T344991) [10:29:55] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [10:30:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [10:30:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [10:31:01] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [10:34:24] (03PS1) 10Clément Goubert: mediawiki: Global values override cpu limit removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952823 (https://phabricator.wikimedia.org/T344814) [10:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P51567 and previous config saved to /var/cache/conftool/dbconfig/20230828-103456-ladsgroup.json [10:35:15] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:38] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [10:36:26] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Global values override cpu limit removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952823 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:37:18] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [10:38:08] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Global values override cpu limit removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952823 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51568 and previous config saved to /var/cache/conftool/dbconfig/20230828-103826-ladsgroup.json [10:38:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T344589)', diff saved to https://phabricator.wikimedia.org/P51569 and previous config saved to /var/cache/conftool/dbconfig/20230828-103827-ladsgroup.json [10:38:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:38:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P51570 and previous config saved to /var/cache/conftool/dbconfig/20230828-103836-ladsgroup.json [10:38:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:38:54] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:39:07] (03Merged) 10jenkins-bot: mediawiki: Global values override cpu limit removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952823 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [10:39:27] (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: Increase logger packet size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952822 (https://phabricator.wikimedia.org/T344991) (owner: 10Kamila Součková) [10:39:50] !log elukey@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [10:41:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:41:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:42:20] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:42:43] (03CR) 10Kamila Součková: [C: 03+2] httpd-fcgi: Increase logger packet size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952822 (https://phabricator.wikimedia.org/T344991) (owner: 10Kamila Součková) [10:42:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:42:58] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] httpd-fcgi: Increase logger packet size [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952822 (https://phabricator.wikimedia.org/T344991) (owner: 10Kamila Součková) [10:43:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:44:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [10:44:05] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [10:44:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T343718)', diff saved to https://phabricator.wikimedia.org/P51571 and previous config saved to /var/cache/conftool/dbconfig/20230828-104407-ladsgroup.json [10:44:12] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:46:10] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952817 (https://phabricator.wikimedia.org/T298315) (owner: 10Caenus) [10:46:11] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [10:46:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [10:48:35] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:49:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [10:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T344589)', diff saved to https://phabricator.wikimedia.org/P51572 and previous config saved to /var/cache/conftool/dbconfig/20230828-105002-ladsgroup.json [10:50:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:50:15] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10jbond) [10:50:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:50:25] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10jbond) p:05Triage→03Medium [10:50:48] !log installing exim4 bugfix updates from Bookworm point release [10:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:03] jouncebot: nowandnext [10:51:03] For the next 0 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1000) [10:51:04] In 2 hour(s) and 8 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1300) [10:51:47] PROBLEM - Host kubestagetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:51:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T343718)', diff saved to https://phabricator.wikimedia.org/P51573 and previous config saved to /var/cache/conftool/dbconfig/20230828-105153-ladsgroup.json [10:51:57] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:52:00] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:52:55] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:53:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T344589)', diff saved to https://phabricator.wikimedia.org/P51574 and previous config saved to /var/cache/conftool/dbconfig/20230828-105342-ladsgroup.json [10:53:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [10:54:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [10:54:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T344589)', diff saved to https://phabricator.wikimedia.org/P51575 and previous config saved to /var/cache/conftool/dbconfig/20230828-105407-ladsgroup.json [10:54:31] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [10:54:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [10:54:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [10:55:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T344589)', diff saved to https://phabricator.wikimedia.org/P51576 and previous config saved to /var/cache/conftool/dbconfig/20230828-105503-ladsgroup.json [10:55:09] !log disable puppet and stop pybal on lvs4009 for reboot (T344587) [10:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:29] RECOVERY - Host kubestagetcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [10:55:51] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 33.60 ms [10:56:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:21] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [10:57:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [10:57:56] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on restbase1027.eqiad.wmnet with reason: T345058 - service probes flapping [10:58:02] T345058: Cassandra instance with corrupted commit log after powercycle of restbase1027 - https://phabricator.wikimedia.org/T345058 [10:58:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on restbase1027.eqiad.wmnet with reason: T345058 - service probes flapping [10:59:05] PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:59:21] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:59:43] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T344589)', diff saved to https://phabricator.wikimedia.org/P51577 and previous config saved to /var/cache/conftool/dbconfig/20230828-110038-ladsgroup.json [11:01:07] PROBLEM - PyBal connections to etcd on lvs4009 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [11:01:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T344589)', diff saved to https://phabricator.wikimedia.org/P51578 and previous config saved to /var/cache/conftool/dbconfig/20230828-110124-ladsgroup.json [11:03:30] (03PS1) 10Fabfur: admin: fabfur's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/952826 [11:04:28] (03CR) 10Fabfur: [C: 03+2] admin: fabfur's dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/952826 (owner: 10Fabfur) [11:05:43] !log kamila@deploy1002 Started scap: base image update due to T344991 [11:05:48] T344991: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 [11:06:44] 10SRE-tools, 10Cloud-VPS, 10Spicerack: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10fnegri) [11:07:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51579 and previous config saved to /var/cache/conftool/dbconfig/20230828-110700-ladsgroup.json [11:07:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [11:07:03] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10fnegri) p:05Triage→03Low [11:07:25] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:27] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:11:01] !log bounce ferm on ml-serve10001 [11:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:39] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:14] !log kamila@deploy1002 Finished scap: base image update due to T344991 (duration: 09m 31s) [11:15:19] T344991: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 [11:15:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs4009.ulsfo.wmnet [11:15:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [11:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P51580 and previous config saved to /var/cache/conftool/dbconfig/20230828-111544-ladsgroup.json [11:16:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P51581 and previous config saved to /var/cache/conftool/dbconfig/20230828-111630-ladsgroup.json [11:18:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs4009.ulsfo.wmnet [11:18:23] PROBLEM - Host lvs4009 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:35] RECOVERY - Host lvs4009 is UP: PING OK - Packet loss = 0%, RTA = 70.88 ms [11:19:09] PROBLEM - pybal on lvs4009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:19:25] PROBLEM - PyBal backends health check on lvs4009 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [11:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T343718)', diff saved to https://phabricator.wikimedia.org/P51582 and previous config saved to /var/cache/conftool/dbconfig/20230828-111949-ladsgroup.json [11:19:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:20:05] !log enable puppet and start pybal on lvs4009 (T344587) [11:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:35] RECOVERY - pybal on lvs4009 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:20:51] RECOVERY - PyBal backends health check on lvs4009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:21:31] RECOVERY - PyBal connections to etcd on lvs4009 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [11:22:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P51583 and previous config saved to /var/cache/conftool/dbconfig/20230828-112206-ladsgroup.json [11:23:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [11:23:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [11:23:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:27:29] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/952830 (owner: 10L10n-bot) [11:28:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:29:13] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [11:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P51584 and previous config saved to /var/cache/conftool/dbconfig/20230828-113050-ladsgroup.json [11:31:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P51585 and previous config saved to /var/cache/conftool/dbconfig/20230828-113136-ladsgroup.json [11:33:01] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [11:34:27] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51587 and previous config saved to /var/cache/conftool/dbconfig/20230828-113455-ladsgroup.json [11:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T343718)', diff saved to https://phabricator.wikimedia.org/P51588 and previous config saved to /var/cache/conftool/dbconfig/20230828-113712-ladsgroup.json [11:37:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [11:37:18] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:37:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [11:37:31] (03CR) 10Ayounsi: [C: 03+2] Allow gNMI from netflow hosts and to Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/948535 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [11:37:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T343718)', diff saved to https://phabricator.wikimedia.org/P51589 and previous config saved to /var/cache/conftool/dbconfig/20230828-113733-ladsgroup.json [11:38:04] (03Merged) 10jenkins-bot: Allow gNMI from netflow hosts and to Juniper devices [homer/public] - 10https://gerrit.wikimedia.org/r/948535 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [11:38:54] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:40:53] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:41:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [11:41:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [11:44:01] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetmaster1006.eqiad.wmnet [11:44:30] (03PS1) 10Filippo Giunchedi: aptrepo: update apt.grafana.com key [puppet] - 10https://gerrit.wikimedia.org/r/952841 [11:45:20] (03PS1) 10Stevemunene: datahub: Add the oidc scope [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) [11:45:53] (03CR) 10CI reject: [V: 04-1] datahub: Add the oidc scope [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [11:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T344589)', diff saved to https://phabricator.wikimedia.org/P51590 and previous config saved to /var/cache/conftool/dbconfig/20230828-114556-ladsgroup.json [11:46:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T344589)', diff saved to https://phabricator.wikimedia.org/P51591 and previous config saved to /var/cache/conftool/dbconfig/20230828-114642-ladsgroup.json [11:46:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [11:47:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [11:47:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T344589)', diff saved to https://phabricator.wikimedia.org/P51592 and previous config saved to /var/cache/conftool/dbconfig/20230828-114706-ladsgroup.json [11:48:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/952841 (owner: 10Filippo Giunchedi) [11:48:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952841 (owner: 10Filippo Giunchedi) [11:48:54] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:49:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [11:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P51593 and previous config saved to /var/cache/conftool/dbconfig/20230828-115003-ladsgroup.json [11:50:21] (03PS1) 10Urbanecm: linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) [11:50:31] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [11:51:01] (03CR) 10CI reject: [V: 04-1] linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) (owner: 10Urbanecm) [11:51:11] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) FYI @Marostegui, who merged https://gerrit.wikimedia.org/r/c/operations/dns/... [11:51:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [11:51:49] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) 05Resolved→03Open This issue's happening again, for the same reasons ({d... [11:53:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Mabualruz) >>! In T342535#9097588, @thcipriani wrote: > @Mabualruz I can't remember have you done our https://wikitech.wikimedia.org/wiki/Deployments/T... [11:53:14] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [11:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T344589)', diff saved to https://phabricator.wikimedia.org/P51594 and previous config saved to /var/cache/conftool/dbconfig/20230828-115328-ladsgroup.json [11:53:47] PROBLEM - Host kubestagetcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [11:53:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Mabualruz) >>! In T342535#9098329, @Clement_Goubert wrote: > @Mabualruz The out of band verification of your SSH public key is still required as well.... [11:53:59] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: update apt.grafana.com key [puppet] - 10https://gerrit.wikimedia.org/r/952841 (owner: 10Filippo Giunchedi) [11:55:29] RECOVERY - Host kubestagetcd2001 is UP: PING OK - Packet loss = 0%, RTA = 32.08 ms [11:55:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1006.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jbond@cumin1001" [11:55:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:55:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1006.eqiad.wmnet [11:55:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): reimage puppetmasteres to puppetserveres - https://phabricator.wikimedia.org/T345067 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetmaster1006.eqiad.wmnet` - puppetmas... [11:57:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [11:57:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [12:00:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [12:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T343718)', diff saved to https://phabricator.wikimedia.org/P51595 and previous config saved to /var/cache/conftool/dbconfig/20230828-120509-ladsgroup.json [12:05:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [12:05:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:05:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [12:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T343718)', diff saved to https://phabricator.wikimedia.org/P51596 and previous config saved to /var/cache/conftool/dbconfig/20230828-120530-ladsgroup.json [12:06:13] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:07:03] (03PS2) 10Urbanecm: linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) [12:07:38] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:08:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P51597 and previous config saved to /var/cache/conftool/dbconfig/20230828-120835-ladsgroup.json [12:09:01] (03PS1) 10KartikMistry: Enable Content and Section translation in Ligurian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952846 (https://phabricator.wikimedia.org/T337669) [12:09:25] (03CR) 10Marostegui: [C: 03+1] linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) (owner: 10Urbanecm) [12:09:34] urbanecm: do you want me to revert for now the failover? [12:09:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [12:09:40] marostegui: nah, i'll deploy the fix. [12:09:40] I can right now if you tell me to [12:09:44] thanks for the +1 :) [12:09:55] (03CR) 10Urbanecm: [C: 03+2] linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) (owner: 10Urbanecm) [12:10:38] (03Merged) 10jenkins-bot: linkrecommendation: Add dbproxy1023 to network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/952843 (https://phabricator.wikimedia.org/T340780) (owner: 10Urbanecm) [12:11:12] !log urbanecm@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [12:11:44] !log urbanecm@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [12:12:38] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:12:39] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [12:13:18] !log urbanecm@deploy1002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [12:13:44] !log disable puppet and stop pybal on lvs5006 for reboot (T344587) [12:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:03] !log urbanecm@deploy1002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [12:14:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5006.eqsin.wmnet [12:14:45] !log urbanecm@deploy1002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [12:14:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T343718)', diff saved to https://phabricator.wikimedia.org/P51598 and previous config saved to /var/cache/conftool/dbconfig/20230828-121454-ladsgroup.json [12:14:59] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:15:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [12:15:09] !log urbanecm@deploy1002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [12:16:12] marostegui: all good now, service's back online. thanks for the offer though :). [12:16:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [12:16:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [12:16:18] urbanecm: thanks! [12:16:53] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:03] (03CR) 10Muehlenhoff: [C: 03+2] docker build: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952300 (owner: 10Muehlenhoff) [12:17:13] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5006.eqsin.wmnet [12:17:59] !log enable puppet and start pybal on lvs5006 (T344587) [12:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:26] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, and 2 others: linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) 05Open→03Resolved And service's up again. [12:19:46] (03CR) 10Muehlenhoff: [C: 03+2] haproxy: Simplify systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/952161 (owner: 10Muehlenhoff) [12:20:23] !log disable puppet and stop pybal on lvs5005 for reboot (T344587) [12:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:29] (03PS2) 10Muehlenhoff: debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 [12:21:49] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:23:25] PROBLEM - pybal on lvs5005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:23:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P51599 and previous config saved to /var/cache/conftool/dbconfig/20230828-122341-ladsgroup.json [12:23:57] PROBLEM - PyBal backends health check on lvs5005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:24:12] (03CR) 10CI reject: [V: 04-1] debian: Remove support for Stretch and update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/952222 (owner: 10Muehlenhoff) [12:25:05] PROBLEM - PyBal connections to etcd on lvs5005 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [12:26:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [12:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [12:29:21] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:29:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [12:29:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1026.eqiad.wmnet with reason: Maintenance [12:30:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51600 and previous config saved to /var/cache/conftool/dbconfig/20230828-123000-ladsgroup.json [12:30:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51601 and previous config saved to /var/cache/conftool/dbconfig/20230828-123004-ladsgroup.json [12:33:37] !log jbond@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:33:41] !log jbond@cumin1001 START - Cookbook sre.dns.netbox [12:34:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:34:09] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [12:34:15] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:34:19] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device cloudsw1-b1-codfw [12:34:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:34:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:34:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [12:34:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:34:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51602 and previous config saved to /var/cache/conftool/dbconfig/20230828-123444-ladsgroup.json [12:34:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [12:34:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T344589)', diff saved to https://phabricator.wikimedia.org/P51603 and previous config saved to /var/cache/conftool/dbconfig/20230828-123452-ladsgroup.json [12:37:05] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5005.eqsin.wmnet [12:37:07] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T344589)', diff saved to https://phabricator.wikimedia.org/P51604 and previous config saved to /var/cache/conftool/dbconfig/20230828-123847-ladsgroup.json [12:38:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:39:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:39:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:39:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T344589)', diff saved to https://phabricator.wikimedia.org/P51605 and previous config saved to /var/cache/conftool/dbconfig/20230828-123917-ladsgroup.json [12:39:41] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add puppetserver1002 - jbond@cumin1001" [12:39:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5005.eqsin.wmnet [12:40:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add puppetserver1002 - jbond@cumin1001" [12:40:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:40:27] RECOVERY - pybal on lvs5005 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:40:48] !log enable puppet and start pybal on lvs5005 (T344587) [12:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:01] RECOVERY - PyBal backends health check on lvs5005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:41:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T344589)', diff saved to https://phabricator.wikimedia.org/P51606 and previous config saved to /var/cache/conftool/dbconfig/20230828-124104-ladsgroup.json [12:41:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51607 and previous config saved to /var/cache/conftool/dbconfig/20230828-124113-ladsgroup.json [12:41:15] RECOVERY - PyBal connections to etcd on lvs5005 is OK: OK: 4 connections established with conf2006.codfw.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [12:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T343718)', diff saved to https://phabricator.wikimedia.org/P51608 and previous config saved to /var/cache/conftool/dbconfig/20230828-124145-ladsgroup.json [12:41:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:42:46] (03PS1) 10Muehlenhoff: Openstack: remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952848 [12:43:30] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10jijiki) >>! In T344110#9117302, @Jhancock.wm wrote: > Turns out there is one more thing I need to do to. I missed a firmware update. Is it safe for me to reb... [12:44:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51609 and previous config saved to /var/cache/conftool/dbconfig/20230828-124457-ladsgroup.json [12:45:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P51610 and previous config saved to /var/cache/conftool/dbconfig/20230828-124506-ladsgroup.json [12:45:13] !log jbond@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host puppetserver1002 [12:46:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host puppetserver1002 [12:47:28] (03PS2) 10Stevemunene: datahub: Add the oidc scope [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) [12:47:48] (03CR) 10Fabfur: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/952555 (owner: 10BBlack) [12:49:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [12:51:16] (03CR) 10BBlack: [C: 03+2] esams: set frontend memory reservation to 128 [puppet] - 10https://gerrit.wikimedia.org/r/952555 (owner: 10BBlack) [12:51:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952848 (owner: 10Muehlenhoff) [12:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T344589)', diff saved to https://phabricator.wikimedia.org/P51611 and previous config saved to /var/cache/conftool/dbconfig/20230828-125237-ladsgroup.json [12:54:17] (03PS1) 10Ayounsi: sre.network.tls: use different ports on junos/sonic [cookbooks] - 10https://gerrit.wikimedia.org/r/952851 (https://phabricator.wikimedia.org/T334594) [12:55:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [12:56:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P51612 and previous config saved to /var/cache/conftool/dbconfig/20230828-125610-ladsgroup.json [12:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P51613 and previous config saved to /var/cache/conftool/dbconfig/20230828-125619-ladsgroup.json [12:56:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51614 and previous config saved to /var/cache/conftool/dbconfig/20230828-125651-ladsgroup.json [12:58:21] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:59] (03PS1) 10Jbond: sre.network.tls: Also add the intermediate certificate to the cert file [cookbooks] - 10https://gerrit.wikimedia.org/r/952852 [12:59:49] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:00:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P51615 and previous config saved to /var/cache/conftool/dbconfig/20230828-130004-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1300) [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:12] (03PS1) 10Muehlenhoff: Remove haveged [puppet] - 10https://gerrit.wikimedia.org/r/952855 [13:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T343718)', diff saved to https://phabricator.wikimedia.org/P51616 and previous config saved to /var/cache/conftool/dbconfig/20230828-130012-ladsgroup.json [13:00:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:00:20] yeah, nothing to deploy it looks likee [13:00:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:00:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:01:34] !log disable puppet and stop pybal on lvs5004 for reboot (T344587) [13:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:38] (03CR) 10CI reject: [V: 04-1] sre.network.tls: Also add the intermediate certificate to the cert file [cookbooks] - 10https://gerrit.wikimedia.org/r/952852 (owner: 10Jbond) [13:01:47] !log esams cp clusters: rolling restarts of varnish-frontend ~1h apart over the next ~8h, to apply memory sizing change from: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952555/ [13:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [13:02:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [13:03:03] (03CR) 10Ayounsi: [C: 03+2] sre.network.tls: use different ports on junos/sonic [cookbooks] - 10https://gerrit.wikimedia.org/r/952851 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:03:45] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:03:47] * urbanecm steals the window then [13:03:55] (03CR) 10Urbanecm: [C: 03+2] Revert "ltwiki: Disable Growth features" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952808 (https://phabricator.wikimedia.org/T344013) (owner: 10Urbanecm) [13:04:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952808 (https://phabricator.wikimedia.org/T344013) (owner: 10Urbanecm) [13:04:27] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:04:38] (03Merged) 10jenkins-bot: Revert "ltwiki: Disable Growth features" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952808 (https://phabricator.wikimedia.org/T344013) (owner: 10Urbanecm) [13:04:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:952808|Revert "ltwiki: Disable Growth features" (T344013)]] [13:05:09] T344013: Growth's config files were deleted on lt.wikipedia - https://phabricator.wikimedia.org/T344013 [13:05:36] (03Merged) 10jenkins-bot: sre.network.tls: use different ports on junos/sonic [cookbooks] - 10https://gerrit.wikimedia.org/r/952851 (https://phabricator.wikimedia.org/T334594) (owner: 10Ayounsi) [13:06:24] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:952808|Revert "ltwiki: Disable Growth features" (T344013)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:26] (03CR) 10Gehel: [C: 03+1] "LGTM. I'm by no mean an expert here, but this seems to make sense and have low enough risk that we could merge and check." [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:07:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P51617 and previous config saved to /var/cache/conftool/dbconfig/20230828-130744-ladsgroup.json [13:07:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [13:08:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2023.codfw.wmnet [13:08:07] PROBLEM - PyBal connections to etcd on lvs5004 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:08:19] !log urbanecm@deploy1002 urbanecm: Continuing with sync [13:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P51618 and previous config saved to /var/cache/conftool/dbconfig/20230828-131117-ladsgroup.json [13:11:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [13:11:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P51619 and previous config saved to /var/cache/conftool/dbconfig/20230828-131125-ladsgroup.json [13:11:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P51620 and previous config saved to /var/cache/conftool/dbconfig/20230828-131157-ladsgroup.json [13:12:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [13:14:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:952808|Revert "ltwiki: Disable Growth features" (T344013)]] (duration: 09m 04s) [13:14:07] T344013: Growth's config files were deleted on lt.wikipedia - https://phabricator.wikimedia.org/T344013 [13:14:15] PROBLEM - Check systemd state on ml-serve1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:51] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026', diff saved to https://phabricator.wikimedia.org/P51621 and previous config saved to /var/cache/conftool/dbconfig/20230828-131510-ladsgroup.json [13:15:51] PROBLEM - MariaDB Replica IO: s2 on clouddb1018 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:15:52] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/952857 (https://phabricator.wikimedia.org/T344309) [13:16:05] PROBLEM - MariaDB Replica IO: s2 on clouddb1014 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1155.eqiad.wmnet:3312 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1155.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:16:13] ^ those are expected [13:17:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:38] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/952857 (https://phabricator.wikimedia.org/T344309) (owner: 10Marostegui) [13:18:09] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Bump version [software] - 10https://gerrit.wikimedia.org/r/952857 (https://phabricator.wikimedia.org/T344309) (owner: 10Marostegui) [13:18:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [13:18:45] (03PS2) 10Jbond: kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 [13:18:45] RECOVERY - MariaDB Replica IO: s2 on clouddb1018 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:57] RECOVERY - MariaDB Replica IO: s2 on clouddb1014 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:19:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [13:19:12] (03PS4) 10David Caro: p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) [13:19:28] (03CR) 10David Caro: p:grafana: pass the envoyproxy::enable to autorestart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [13:20:18] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs5004.eqsin.wmnet [13:20:23] RECOVERY - Host kubetcd2004 is UP: PING WARNING - Packet loss = 60%, RTA = 39.87 ms [13:21:14] (03CR) 10CI reject: [V: 04-1] kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [13:22:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [13:22:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P51622 and previous config saved to /var/cache/conftool/dbconfig/20230828-132250-ladsgroup.json [13:23:09] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs5004.eqsin.wmnet [13:23:15] PROBLEM - pybal on lvs5004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:23:57] PROBLEM - PyBal backends health check on lvs5004 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:24:29] !log enable puppet and start pybal on lvs5004 (T344587) [13:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:41] RECOVERY - pybal on lvs5004 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:25:19] RECOVERY - PyBal connections to etcd on lvs5004 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:25:19] RECOVERY - PyBal backends health check on lvs5004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:26:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T344589)', diff saved to https://phabricator.wikimedia.org/P51623 and previous config saved to /var/cache/conftool/dbconfig/20230828-132623-ladsgroup.json [13:26:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [13:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51624 and previous config saved to /var/cache/conftool/dbconfig/20230828-132632-ladsgroup.json [13:26:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:26:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [13:26:46] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 (10kamila) 05Open→03Resolved The message size limit is increased to 16k. Longer messages are very rare (< 1 per hour), so I think this is acceptable.... [13:26:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T344589)', diff saved to https://phabricator.wikimedia.org/P51625 and previous config saved to /var/cache/conftool/dbconfig/20230828-132648-ladsgroup.json [13:26:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:26:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T344589)', diff saved to https://phabricator.wikimedia.org/P51626 and previous config saved to /var/cache/conftool/dbconfig/20230828-132655-ladsgroup.json [13:27:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T343718)', diff saved to https://phabricator.wikimedia.org/P51627 and previous config saved to /var/cache/conftool/dbconfig/20230828-132703-ladsgroup.json [13:27:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:27:08] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:27:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [13:27:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T343718)', diff saved to https://phabricator.wikimedia.org/P51628 and previous config saved to /var/cache/conftool/dbconfig/20230828-132724-ladsgroup.json [13:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1026 (T344589)', diff saved to https://phabricator.wikimedia.org/P51629 and previous config saved to /var/cache/conftool/dbconfig/20230828-133016-ladsgroup.json [13:30:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1030.eqiad.wmnet with reason: Maintenance [13:30:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1030.eqiad.wmnet with reason: Maintenance [13:30:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1030 (T344589)', diff saved to https://phabricator.wikimedia.org/P51630 and previous config saved to /var/cache/conftool/dbconfig/20230828-133040-ladsgroup.json [13:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T344589)', diff saved to https://phabricator.wikimedia.org/P51631 and previous config saved to /var/cache/conftool/dbconfig/20230828-133514-ladsgroup.json [13:36:28] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) This is complete from my POV: the production fleet runs cadvisor and we're coll... [13:36:37] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) @jijiki all good. I knew it needed time to do its thing. I'll be on site for the next 3 hours today and roughly the same time for the rest of th... [13:36:41] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [13:37:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:37:49] (03CR) 10FNegri: [C: 03+1] p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [13:37:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T344589)', diff saved to https://phabricator.wikimedia.org/P51632 and previous config saved to /var/cache/conftool/dbconfig/20230828-133756-ladsgroup.json [13:38:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:38:07] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi [13:38:40] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater-dse-k8s: Add Zookeeper HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [13:39:30] (03CR) 10Samtar: [C: 03+1] Customise $wgSitename on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952563 (https://phabricator.wikimedia.org/T181908) (owner: 10Tim Starling) [13:39:42] (03Merged) 10jenkins-bot: rdf-streaming-updater-dse-k8s: Add Zookeeper HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [13:41:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:41:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [13:41:33] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:41:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T343718)', diff saved to https://phabricator.wikimedia.org/P51633 and previous config saved to /var/cache/conftool/dbconfig/20230828-134137-ladsgroup.json [13:41:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:41:47] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:42:51] RECOVERY - Check systemd state on ml-serve1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:11] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [13:45:27] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device cloudsw1-b1-codfw [13:45:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw1-b1-codfw [13:45:34] (03CR) 10Jbond: [C: 03+1] "See inline for ideas on getting the bundle created" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:46:43] (RedisMemoryFull) firing: (3) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:46:47] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10MoritzMuehlenhoff) [13:46:55] 10SRE, 10ops-eqiad: Broken disk on thanos-be1003 - https://phabricator.wikimedia.org/T345079 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:48:47] (03PS1) 10Jbond: puppetserver: puppetmaster1006 has been renamed to puppetserver1002 [puppet] - 10https://gerrit.wikimedia.org/r/952861 (https://phabricator.wikimedia.org/T345067) [13:49:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [13:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1030 (T344589)', diff saved to https://phabricator.wikimedia.org/P51634 and previous config saved to /var/cache/conftool/dbconfig/20230828-134934-ladsgroup.json [13:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [13:49:44] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel: puppetmaster1006 to puppetserver1002 - https://phabricator.wikimedia.org/T345080 (10jbond) [13:49:58] (03CR) 10Jbond: [C: 03+2] puppetserver: puppetmaster1006 has been renamed to puppetserver1002 [puppet] - 10https://gerrit.wikimedia.org/r/952861 (https://phabricator.wikimedia.org/T345067) (owner: 10Jbond) [13:50:15] (03PS1) 10Muehlenhoff: Add some ferm->nft migration steps to the firewall class [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) [13:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P51635 and previous config saved to /var/cache/conftool/dbconfig/20230828-135021-ladsgroup.json [13:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P51636 and previous config saved to /var/cache/conftool/dbconfig/20230828-135021-ladsgroup.json [13:50:39] (03CR) 10CI reject: [V: 04-1] Add some ferm->nft migration steps to the firewall class [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:50:48] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host puppetserver1002.eqiad.wmnet with OS bookworm [13:51:28] !log bounce ferm on ml-serve1006 [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:18] (03PS1) 10Bking: wdqs1005: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) [13:53:38] (03PS1) 10Marostegui: mariadb: Move db1118 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/952865 (https://phabricator.wikimedia.org/T138915) [13:53:45] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:13] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:55:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1118 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/952865 (https://phabricator.wikimedia.org/T138915) (owner: 10Marostegui) [13:58:13] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/952855 (owner: 10Muehlenhoff) [13:58:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:58:23] PROBLEM - haproxy failover on dbproxy1023 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:58:31] PROBLEM - haproxy failover on dbproxy1025 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:00:03] (03CR) 10Stevemunene: [C: 03+2] datahub: Add the oidc scope [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:00:51] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:00:54] (03Merged) 10jenkins-bot: datahub: Add the oidc scope [deployment-charts] - 10https://gerrit.wikimedia.org/r/952842 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [14:01:02] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Improve MediaWiki logging of errors uploading files to Swift - https://phabricator.wikimedia.org/T231107 (10Krinkle) [14:01:31] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:01:36] haproxy alerts are expected [14:01:41] I was about to ask :D [14:01:59] :) [14:02:05] (03PS8) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [14:02:20] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:02:45] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:03:41] (03PS1) 10BBlack: esams: set frontend memory reservation to 170 [puppet] - 10https://gerrit.wikimedia.org/r/952866 [14:03:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T343718)', diff saved to https://phabricator.wikimedia.org/P51637 and previous config saved to /var/cache/conftool/dbconfig/20230828-140345-ladsgroup.json [14:03:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:04:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1030', diff saved to https://phabricator.wikimedia.org/P51638 and previous config saved to /var/cache/conftool/dbconfig/20230828-140440-ladsgroup.json [14:05:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P51639 and previous config saved to /var/cache/conftool/dbconfig/20230828-140527-ladsgroup.json [14:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P51640 and previous config saved to /var/cache/conftool/dbconfig/20230828-140528-ladsgroup.json [14:07:20] (03CR) 10Jforrester: "This is set to `https://api-rw.discovery.wmnet/w/api.php` in the chart; if that's wrong, should we not just adjust it there?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [14:07:20] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:07:24] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:07:41] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:07:56] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:08:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr) lists1004 C6. U19. Port 16 Cableid 3222 [14:08:54] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:27] (03CR) 10Filippo Giunchedi: [C: 03+1] p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [14:09:29] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:09:37] PROBLEM - Check systemd state on ml-serve1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374 (10Jclark-ctr) [14:11:33] !log disable puppet and stop pybal on lvs6003 for reboot (T344587) [14:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:39] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:12:08] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6003.drmrs.wmnet [14:12:29] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:12:33] 10SRE-swift-storage, 10MediaWiki-extensions-Nuke: Special:Nuke on Commons creates "Error: ERR_READ_TIMEOUT, errno [No Error]" while mass-deleting about >20 files - https://phabricator.wikimedia.org/T40028 (10Samwalton9) Since the last update was nearly 10 years ago - is this still happening? [14:12:47] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:13:11] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:13:49] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:14:45] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6003.drmrs.wmnet [14:14:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:14:51] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:14:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51641 and previous config saved to /var/cache/conftool/dbconfig/20230828-141505-ladsgroup.json [14:15:11] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:15:16] (03PS1) 10Clément Goubert: mediawiki: Remove tls-proxy CPU limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/952867 (https://phabricator.wikimedia.org/T344814) [14:15:45] (03CR) 10Vgutierrez: [C: 03+1] esams: set frontend memory reservation to 170 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952866 (owner: 10BBlack) [14:16:05] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:14] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [14:16:19] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:16:33] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:50] !log enable puppet and start pybal on lvs6003 (T344587) [14:16:53] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:59] RECOVERY - cassandra-a service on restbase1027 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51642 and previous config saved to /var/cache/conftool/dbconfig/20230828-141851-ladsgroup.json [14:18:54] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1030', diff saved to https://phabricator.wikimedia.org/P51643 and previous config saved to /var/cache/conftool/dbconfig/20230828-141946-ladsgroup.json [14:19:53] RECOVERY - cassandra-a SSL 10.64.48.184:7000 on restbase1027 is OK: SSL OK - Certificate restbase1027-a valid until 2025-02-21 18:43:51 +0000 (expires in 543 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:19:53] RECOVERY - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is OK: TCP OK - 0.001 second response time on 10.64.48.184 port 9042 https://phabricator.wikimedia.org/T93886 [14:20:04] !log disable puppet and stop pybal on lvs6002 for reboot (T344587) [14:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T344589)', diff saved to https://phabricator.wikimedia.org/P51644 and previous config saved to /var/cache/conftool/dbconfig/20230828-142033-ladsgroup.json [14:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T344589)', diff saved to https://phabricator.wikimedia.org/P51645 and previous config saved to /var/cache/conftool/dbconfig/20230828-142034-ladsgroup.json [14:20:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:20:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:20:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [14:20:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51646 and previous config saved to /var/cache/conftool/dbconfig/20230828-142056-ladsgroup.json [14:21:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51647 and previous config saved to /var/cache/conftool/dbconfig/20230828-142105-ladsgroup.json [14:22:47] PROBLEM - PyBal backends health check on lvs6002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:24:32] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: Add deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952868 (https://phabricator.wikimedia.org/T276095) [14:24:49] (03CR) 10BBlack: [C: 03+2] esams: set frontend memory reservation to 170 [puppet] - 10https://gerrit.wikimedia.org/r/952866 (owner: 10BBlack) [14:25:25] RECOVERY - Check systemd state on ml-serve1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:30] !log bounced ferm.service on ml-serve1007 [14:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:35] PROBLEM - PyBal connections to etcd on lvs6002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [14:26:26] claime: thanks! [14:26:55] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:26:59] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:27:03] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:27:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51648 and previous config saved to /var/cache/conftool/dbconfig/20230828-142718-ladsgroup.json [14:27:36] (03CR) 10Clément Goubert: [C: 03+1] benthos/mw_accesslog_metrics: Add deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952868 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [14:28:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [14:28:19] (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: Add deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952868 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [14:28:51] elukey: yw ;) [14:29:23] PROBLEM - pybal on lvs6002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:29:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51649 and previous config saved to /var/cache/conftool/dbconfig/20230828-142927-ladsgroup.json [14:29:41] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:29:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [14:30:31] (03Abandoned) 10Jbond: sre.network.tls: Also add the intermediate certificate to the cert file [cookbooks] - 10https://gerrit.wikimedia.org/r/952852 (owner: 10Jbond) [14:31:03] (03CR) 10Jbond: [C: 03+1] "\o/ lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952855 (owner: 10Muehlenhoff) [14:31:11] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:31:13] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1027.eqiad.wmnet [14:31:14] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1027.eqiad.wmnet [14:31:15] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:31:21] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:32:39] !log esams cp clusters: rolling restarts of varnish-frontend ~1h apart over the next ~8h, to apply memory sizing change from: https://gerrit.wikimedia.org/r/c/operations/puppet/+/952866/ (earlier run only did 1 host per cluster before we changed direction!) [14:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:47] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:44] (03PS9) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [14:33:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P51650 and previous config saved to /var/cache/conftool/dbconfig/20230828-143357-ladsgroup.json [14:34:29] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:34:31] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:34:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1030 (T344589)', diff saved to https://phabricator.wikimedia.org/P51651 and previous config saved to /var/cache/conftool/dbconfig/20230828-143453-ladsgroup.json [14:34:57] (03PS2) 10Muehlenhoff: Add some ferm->nft migration steps to the firewall class [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) [14:36:50] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetserver1002.eqiad.wmnet with OS bookworm [14:37:25] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6002.drmrs.wmnet [14:37:33] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:37:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:37:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1027.eqiad.wmnet with reason: Maintenance [14:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1027.eqiad.wmnet with reason: Maintenance [14:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51652 and previous config saved to /var/cache/conftool/dbconfig/20230828-143808-ladsgroup.json [14:38:21] (03CR) 10Jbond: Add gNMI based telemetry collection using gNMIc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:38:47] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:38:49] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:38:59] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:39:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [14:39:34] (03PS10) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [14:39:47] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:40:02] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6002.drmrs.wmnet [14:40:05] (03CR) 10Ayounsi: Add gNMI based telemetry collection using gNMIc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:40:15] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:40:26] !log enable puppet and start pybal on lvs6002 (T344587) [14:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:40:57] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:11] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:11] RECOVERY - PyBal connections to etcd on lvs6002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [14:41:11] RECOVERY - pybal on lvs6002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:41:11] RECOVERY - PyBal backends health check on lvs6002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:41:17] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:21] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:25] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:41:27] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:39] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:41:41] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:42:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P51653 and previous config saved to /var/cache/conftool/dbconfig/20230828-144224-ladsgroup.json [14:43:21] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51654 and previous config saved to /var/cache/conftool/dbconfig/20230828-144406-ladsgroup.json [14:44:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P51655 and previous config saved to /var/cache/conftool/dbconfig/20230828-144433-ladsgroup.json [14:44:49] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:41] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:45:47] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:45:59] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:46:01] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:46:43] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:46:54] (03PS1) 10Kamila Součková: benthos/mw_accesslog_metrics: fix deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952869 (https://phabricator.wikimedia.org/T276095) [14:46:57] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:47:01] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:47:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [14:48:42] (03CR) 10Clément Goubert: [C: 03+1] benthos/mw_accesslog_metrics: fix deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952869 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [14:49:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T343718)', diff saved to https://phabricator.wikimedia.org/P51656 and previous config saved to /var/cache/conftool/dbconfig/20230828-144903-ladsgroup.json [14:49:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:49:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:49:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [14:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51657 and previous config saved to /var/cache/conftool/dbconfig/20230828-144924-ladsgroup.json [14:49:35] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:50:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:51:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51658 and previous config saved to /var/cache/conftool/dbconfig/20230828-145116-ladsgroup.json [14:52:02] (03CR) 10Kamila Součková: [C: 03+2] benthos/mw_accesslog_metrics: fix deployment label [puppet] - 10https://gerrit.wikimedia.org/r/952869 (https://phabricator.wikimedia.org/T276095) (owner: 10Kamila Součková) [14:52:39] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:52:45] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:53:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [14:53:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [14:54:15] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:54:19] PROBLEM - Check systemd state on ml-serve1008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:21] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [14:54:51] !log bounced ferm.service on ml-serve1008 [14:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:27] claime: it is the last one :D [14:55:30] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [14:55:31] :D [14:55:41] (03CR) 10Herron: LiftWing: add latency/availability SLO dashboards (036 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (owner: 10Klausman) [14:55:45] RECOVERY - Check systemd state on ml-serve1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:40] (03PS3) 10Jbond: kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 [14:56:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove haveged [puppet] - 10https://gerrit.wikimedia.org/r/952855 (owner: 10Muehlenhoff) [14:57:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P51659 and previous config saved to /var/cache/conftool/dbconfig/20230828-145730-ladsgroup.json [14:58:31] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:58:35] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:58:41] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:59:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [14:59:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [14:59:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027', diff saved to https://phabricator.wikimedia.org/P51660 and previous config saved to /var/cache/conftool/dbconfig/20230828-145912-ladsgroup.json [14:59:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T343718)', diff saved to https://phabricator.wikimedia.org/P51661 and previous config saved to /var/cache/conftool/dbconfig/20230828-145921-ladsgroup.json [14:59:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P51662 and previous config saved to /var/cache/conftool/dbconfig/20230828-145940-ladsgroup.json [14:59:59] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:00:03] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:00:09] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:00:46] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy fix for ores-legacy item-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/952874 (https://phabricator.wikimedia.org/T345063) [15:01:07] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:03:05] (03CR) 10Elukey: [C: 03+1] ml-services: deploy fix for ores-legacy item-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/952874 (https://phabricator.wikimedia.org/T345063) (owner: 10Ilias Sarantopoulos) [15:03:30] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy fix for ores-legacy item-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/952874 (https://phabricator.wikimedia.org/T345063) (owner: 10Ilias Sarantopoulos) [15:03:59] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:04:13] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:04:19] (03Merged) 10jenkins-bot: ml-services: deploy fix for ores-legacy item-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/952874 (https://phabricator.wikimedia.org/T345063) (owner: 10Ilias Sarantopoulos) [15:04:41] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:04:43] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:05:35] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:06:12] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P51663 and previous config saved to /var/cache/conftool/dbconfig/20230828-150622-ladsgroup.json [15:07:00] (03PS2) 10Slyngshede: Email on successful signup. [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 [15:07:06] !log isaranto@deploy1002 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [15:07:09] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:07:13] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:07:30] (03CR) 10David Caro: [C: 03+2] p:grafana: pass the envoyproxy::enable to autorestart [puppet] - 10https://gerrit.wikimedia.org/r/952813 (https://phabricator.wikimedia.org/T345060) (owner: 10David Caro) [15:08:43] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:08:55] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:08:57] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:09:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:11:03] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:12:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T344589)', diff saved to https://phabricator.wikimedia.org/P51664 and previous config saved to /var/cache/conftool/dbconfig/20230828-151236-ladsgroup.json [15:12:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:12:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:13:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51665 and previous config saved to /var/cache/conftool/dbconfig/20230828-151300-ladsgroup.json [15:14:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027', diff saved to https://phabricator.wikimedia.org/P51666 and previous config saved to /var/cache/conftool/dbconfig/20230828-151418-ladsgroup.json [15:14:39] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:14:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T344589)', diff saved to https://phabricator.wikimedia.org/P51667 and previous config saved to /var/cache/conftool/dbconfig/20230828-151446-ladsgroup.json [15:14:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:14:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:14:54] (03PS1) 10Jbond: interfaces: updated to use f-strings [puppet] - 10https://gerrit.wikimedia.org/r/952876 [15:14:56] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:15:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [15:15:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T344589)', diff saved to https://phabricator.wikimedia.org/P51668 and previous config saved to /var/cache/conftool/dbconfig/20230828-151511-ladsgroup.json [15:15:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:15:55] (03PS1) 10Muehlenhoff: Remove comment references to EOLed distros [puppet] - 10https://gerrit.wikimedia.org/r/952877 [15:16:07] !log disable puppet and stop pybal on lvs6001 for reboot (T344587) [15:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:19] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:47] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:18:27] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:31] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:37] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:41] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:57] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [15:18:59] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:19:27] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:42] (03PS1) 10Muehlenhoff: java: Remove now obsolete warning [puppet] - 10https://gerrit.wikimedia.org/r/952879 [15:19:55] PROBLEM - pybal on lvs6001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51669 and previous config saved to /var/cache/conftool/dbconfig/20230828-152004-ladsgroup.json [15:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T344589)', diff saved to https://phabricator.wikimedia.org/P51670 and previous config saved to /var/cache/conftool/dbconfig/20230828-152121-ladsgroup.json [15:21:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P51671 and previous config saved to /var/cache/conftool/dbconfig/20230828-152128-ladsgroup.json [15:22:26] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344872 (10Papaul) 05Open→03Resolved We close this since it is been fixed on https://phabricator.wikimedia.org/T344110 [15:23:09] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:25:21] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:26:03] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:26:05] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:28:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51672 and previous config saved to /var/cache/conftool/dbconfig/20230828-152820-ladsgroup.json [15:28:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:29:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1027 (T344589)', diff saved to https://phabricator.wikimedia.org/P51673 and previous config saved to /var/cache/conftool/dbconfig/20230828-152925-ladsgroup.json [15:29:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1029.eqiad.wmnet with reason: Maintenance [15:29:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1029.eqiad.wmnet with reason: Maintenance [15:29:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51674 and previous config saved to /var/cache/conftool/dbconfig/20230828-152948-ladsgroup.json [15:29:55] (03CR) 10Abijeet Patro: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/952830 (owner: 10L10n-bot) [15:30:05] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1530). [15:32:03] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs6001.drmrs.wmnet [15:34:40] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs6001.drmrs.wmnet [15:34:47] PROBLEM - PyBal backends health check on lvs6001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [15:34:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51675 and previous config saved to /var/cache/conftool/dbconfig/20230828-153447-ladsgroup.json [15:35:01] !log enable puppet and start pybal on lvs6001 (T344587) [15:35:03] (03PS1) 10Muehlenhoff: SSH cloud access: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952880 [15:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P51676 and previous config saved to /var/cache/conftool/dbconfig/20230828-153510-ladsgroup.json [15:35:15] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [15:35:47] RECOVERY - pybal on lvs6001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [15:36:15] RECOVERY - PyBal backends health check on lvs6001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:36:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [15:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P51677 and previous config saved to /var/cache/conftool/dbconfig/20230828-153627-ladsgroup.json [15:36:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T343718)', diff saved to https://phabricator.wikimedia.org/P51678 and previous config saved to /var/cache/conftool/dbconfig/20230828-153634-ladsgroup.json [15:36:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:36:40] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:36:41] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['moss-be2003'] [15:36:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:36:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T343718)', diff saved to https://phabricator.wikimedia.org/P51679 and previous config saved to /var/cache/conftool/dbconfig/20230828-153655-ladsgroup.json [15:37:18] (03PS1) 10Muehlenhoff: mariadb::packages_wmf: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/952881 [15:38:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) [15:39:23] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [15:42:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, few typos inline" [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [15:42:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952881 (owner: 10Muehlenhoff) [15:43:01] RECOVERY - haproxy failover on dbproxy1025 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:43:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51680 and previous config saved to /var/cache/conftool/dbconfig/20230828-154327-ladsgroup.json [15:43:47] RECOVERY - haproxy failover on dbproxy1023 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:47:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:48:56] (03PS1) 10Marostegui: mariadb: Comments to clarify db1118 situation [puppet] - 10https://gerrit.wikimedia.org/r/952885 [15:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1029', diff saved to https://phabricator.wikimedia.org/P51681 and previous config saved to /var/cache/conftool/dbconfig/20230828-154953-ladsgroup.json [15:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P51682 and previous config saved to /var/cache/conftool/dbconfig/20230828-155016-ladsgroup.json [15:51:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P51683 and previous config saved to /var/cache/conftool/dbconfig/20230828-155133-ladsgroup.json [15:52:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:56:26] (03CR) 10JMeybohm: Fix wikifunctions orchestrator not using the service mesh (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952782 (https://phabricator.wikimedia.org/T344998) (owner: 10JMeybohm) [15:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T343718)', diff saved to https://phabricator.wikimedia.org/P51684 and previous config saved to /var/cache/conftool/dbconfig/20230828-155830-ladsgroup.json [15:58:36] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:58:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51685 and previous config saved to /var/cache/conftool/dbconfig/20230828-155839-ladsgroup.json [15:59:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952879 (owner: 10Muehlenhoff) [16:01:42] (03PS1) 10BCornwall: admin: Add bash, tmux, vim dotfiles for brett [puppet] - 10https://gerrit.wikimedia.org/r/952887 [16:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1029', diff saved to https://phabricator.wikimedia.org/P51686 and previous config saved to /var/cache/conftool/dbconfig/20230828-160459-ladsgroup.json [16:05:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51687 and previous config saved to /var/cache/conftool/dbconfig/20230828-160522-ladsgroup.json [16:05:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:05:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:05:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T344589)', diff saved to https://phabricator.wikimedia.org/P51688 and previous config saved to /var/cache/conftool/dbconfig/20230828-160546-ladsgroup.json [16:06:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T344589)', diff saved to https://phabricator.wikimedia.org/P51689 and previous config saved to /var/cache/conftool/dbconfig/20230828-160639-ladsgroup.json [16:06:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:06:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:07:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:07:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [16:07:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T344589)', diff saved to https://phabricator.wikimedia.org/P51690 and previous config saved to /var/cache/conftool/dbconfig/20230828-160709-ladsgroup.json [16:08:34] (03PS1) 10Jbond: ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) [16:10:12] (03CR) 10Jbond: "see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/952862 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:10:46] (03CR) 10CI reject: [V: 04-1] ferm: add ensure support to the ferm class [puppet] - 10https://gerrit.wikimedia.org/r/952889 (https://phabricator.wikimedia.org/T336497) (owner: 10Jbond) [16:10:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952880 (owner: 10Muehlenhoff) [16:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T344589)', diff saved to https://phabricator.wikimedia.org/P51691 and previous config saved to /var/cache/conftool/dbconfig/20230828-161147-ladsgroup.json [16:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T343718)', diff saved to https://phabricator.wikimedia.org/P51692 and previous config saved to /var/cache/conftool/dbconfig/20230828-161306-ladsgroup.json [16:13:13] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:13:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T344589)', diff saved to https://phabricator.wikimedia.org/P51693 and previous config saved to /var/cache/conftool/dbconfig/20230828-161320-ladsgroup.json [16:13:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P51694 and previous config saved to /var/cache/conftool/dbconfig/20230828-161337-ladsgroup.json [16:13:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51695 and previous config saved to /var/cache/conftool/dbconfig/20230828-161345-ladsgroup.json [16:13:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:14:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:14:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51696 and previous config saved to /var/cache/conftool/dbconfig/20230828-161406-ladsgroup.json [16:16:40] (03CR) 10Jbond: [C: 03+1] "fyi feel free to +2 and self merge user files in your own home dir" [puppet] - 10https://gerrit.wikimedia.org/r/952887 (owner: 10BCornwall) [16:17:16] (03CR) 10BCornwall: [C: 03+2] admin: Add bash, tmux, vim dotfiles for brett [puppet] - 10https://gerrit.wikimedia.org/r/952887 (owner: 10BCornwall) [16:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1029 (T344589)', diff saved to https://phabricator.wikimedia.org/P51697 and previous config saved to /var/cache/conftool/dbconfig/20230828-162005-ladsgroup.json [16:26:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P51698 and previous config saved to /var/cache/conftool/dbconfig/20230828-162654-ladsgroup.json [16:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P51699 and previous config saved to /var/cache/conftool/dbconfig/20230828-162812-ladsgroup.json [16:28:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P51700 and previous config saved to /var/cache/conftool/dbconfig/20230828-162826-ladsgroup.json [16:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P51701 and previous config saved to /var/cache/conftool/dbconfig/20230828-162843-ladsgroup.json [16:30:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [16:30:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [16:31:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:32:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:32:48] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:32:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [16:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:36:42] (03PS1) 10Bking: flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/952891 (https://phabricator.wikimedia.org/T344614) [16:37:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:42:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P51702 and previous config saved to /var/cache/conftool/dbconfig/20230828-164200-ladsgroup.json [16:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P51703 and previous config saved to /var/cache/conftool/dbconfig/20230828-164318-ladsgroup.json [16:43:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P51704 and previous config saved to /var/cache/conftool/dbconfig/20230828-164332-ladsgroup.json [16:43:46] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/952894 [16:43:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T343718)', diff saved to https://phabricator.wikimedia.org/P51705 and previous config saved to /var/cache/conftool/dbconfig/20230828-164349-ladsgroup.json [16:43:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:43:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [16:43:55] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:44:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T343718)', diff saved to https://phabricator.wikimedia.org/P51706 and previous config saved to /var/cache/conftool/dbconfig/20230828-164359-ladsgroup.json [16:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:51:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51707 and previous config saved to /var/cache/conftool/dbconfig/20230828-165131-ladsgroup.json [16:51:37] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:57:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T344589)', diff saved to https://phabricator.wikimedia.org/P51708 and previous config saved to /var/cache/conftool/dbconfig/20230828-165706-ladsgroup.json [16:57:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:57:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:57:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T344589)', diff saved to https://phabricator.wikimedia.org/P51709 and previous config saved to /var/cache/conftool/dbconfig/20230828-165730-ladsgroup.json [16:57:39] (03PS1) 10Volans: sre.hosts: remove stretch from list of OSes [cookbooks] - 10https://gerrit.wikimedia.org/r/952897 [16:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T343718)', diff saved to https://phabricator.wikimedia.org/P51710 and previous config saved to /var/cache/conftool/dbconfig/20230828-165824-ladsgroup.json [16:58:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:58:30] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:58:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T344589)', diff saved to https://phabricator.wikimedia.org/P51711 and previous config saved to /var/cache/conftool/dbconfig/20230828-165839-ladsgroup.json [16:58:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:58:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T343718)', diff saved to https://phabricator.wikimedia.org/P51712 and previous config saved to /var/cache/conftool/dbconfig/20230828-165846-ladsgroup.json [16:58:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:59:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51713 and previous config saved to /var/cache/conftool/dbconfig/20230828-165906-ladsgroup.json [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1700) [17:00:06] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1700). [17:00:16] !log bking@cumin1001 conftool action : set/pooled=no; selector: name=wdqs1005.eqiad.wmnet [17:00:35] !log bking@cumin1001 depool wdqs1005 for decom T344198 [17:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:43] T344198: Decommission wdqs10[03-05] - https://phabricator.wikimedia.org/T344198 [17:04:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T344589)', diff saved to https://phabricator.wikimedia.org/P51714 and previous config saved to /var/cache/conftool/dbconfig/20230828-170435-ladsgroup.json [17:06:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51715 and previous config saved to /var/cache/conftool/dbconfig/20230828-170632-ladsgroup.json [17:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51716 and previous config saved to /var/cache/conftool/dbconfig/20230828-170637-ladsgroup.json [17:10:39] (03PS1) 10JMeybohm: envoy: Create /var/run/envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 [17:11:36] (03PS1) 10JMeybohm: mesh.configuration: Add new minor version 1.4.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952901 [17:11:38] (03PS1) 10JMeybohm: mesh.configuration: Bind the admin interface to a socket instead of tcp [deployment-charts] - 10https://gerrit.wikimedia.org/r/952902 [17:13:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Ladsgroup) I'm the clinic duty this week. I will take over from here. Let me double check the ssh key. [17:13:20] (03PS2) 10JMeybohm: envoy: Create /var/run/envoy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952900 (https://phabricator.wikimedia.org/T343709) [17:16:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:17:42] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [17:17:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [17:19:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P51717 and previous config saved to /var/cache/conftool/dbconfig/20230828-171942-ladsgroup.json [17:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:21:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P51718 and previous config saved to /var/cache/conftool/dbconfig/20230828-172138-ladsgroup.json [17:21:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P51719 and previous config saved to /var/cache/conftool/dbconfig/20230828-172143-ladsgroup.json [17:24:42] jouncebot: nowandnext [17:24:42] For the next 0 hour(s) and 35 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1700) [17:24:42] For the next 0 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T1700) [17:24:42] In 2 hour(s) and 35 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T2000) [17:25:01] is anyone deploying anything in the infra window or can I quickly do a config patch? [17:25:17] (03PS2) 10Majavah: Set OATHAuth multiple devices WRITE_BOTH for all privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952184 (https://phabricator.wikimedia.org/T242031) [17:25:19] (03PS2) 10Majavah: Set OATHAuth multiple devices READ_NEW for checkuser, techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952185 (https://phabricator.wikimedia.org/T242031) [17:26:30] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/952879 (owner: 10Muehlenhoff) [17:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T343718)', diff saved to https://phabricator.wikimedia.org/P51720 and previous config saved to /var/cache/conftool/dbconfig/20230828-172858-ladsgroup.json [17:29:04] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:29:44] (03PS4) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [17:29:57] (03CR) 10Slyngshede: Allow Unix shell account to be specified. (035 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 (owner: 10Slyngshede) [17:30:48] (03CR) 10Slyngshede: "Needs to rebase once the Unix account name patch is merged. See: https://gerrit.wikimedia.org/r/c/operations/software/bitu/+/952402" [software/bitu] - 10https://gerrit.wikimedia.org/r/952658 (owner: 10Slyngshede) [17:31:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952184 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [17:31:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952185 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [17:31:52] (03Merged) 10jenkins-bot: Set OATHAuth multiple devices WRITE_BOTH for all privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952184 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [17:31:55] (03Merged) 10jenkins-bot: Set OATHAuth multiple devices READ_NEW for checkuser, techconduct [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952185 (https://phabricator.wikimedia.org/T242031) (owner: 10Majavah) [17:32:10] !log taavi@deploy1002 Started scap: Backport for [[gerrit:952184|Set OATHAuth multiple devices WRITE_BOTH for all privates (T242031)]], [[gerrit:952185|Set OATHAuth multiple devices READ_NEW for checkuser, techconduct (T242031)]] [17:32:15] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [17:33:42] !log taavi@deploy1002 taavi: Backport for [[gerrit:952184|Set OATHAuth multiple devices WRITE_BOTH for all privates (T242031)]], [[gerrit:952185|Set OATHAuth multiple devices READ_NEW for checkuser, techconduct (T242031)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD opti [17:33:42] on) [17:34:21] !log taavi@deploy1002 taavi: Continuing with sync [17:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P51721 and previous config saved to /var/cache/conftool/dbconfig/20230828-173448-ladsgroup.json [17:35:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T343718)', diff saved to https://phabricator.wikimedia.org/P51722 and previous config saved to /var/cache/conftool/dbconfig/20230828-173506-ladsgroup.json [17:35:12] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P51723 and previous config saved to /var/cache/conftool/dbconfig/20230828-173645-ladsgroup.json [17:36:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T343718)', diff saved to https://phabricator.wikimedia.org/P51724 and previous config saved to /var/cache/conftool/dbconfig/20230828-173650-ladsgroup.json [17:36:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:37:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:37:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:37:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:37:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T343718)', diff saved to https://phabricator.wikimedia.org/P51725 and previous config saved to /var/cache/conftool/dbconfig/20230828-173726-ladsgroup.json [17:37:45] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:11] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:51] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:952184|Set OATHAuth multiple devices WRITE_BOTH for all privates (T242031)]], [[gerrit:952185|Set OATHAuth multiple devices READ_NEW for checkuser, techconduct (T242031)]] (duration: 07m 41s) [17:39:57] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [17:40:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:43:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [17:44:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P51726 and previous config saved to /var/cache/conftool/dbconfig/20230828-174404-ladsgroup.json [17:45:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:46:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [17:46:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [17:49:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T344589)', diff saved to https://phabricator.wikimedia.org/P51727 and previous config saved to /var/cache/conftool/dbconfig/20230828-174954-ladsgroup.json [17:50:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [17:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51728 and previous config saved to /var/cache/conftool/dbconfig/20230828-175013-ladsgroup.json [17:50:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [17:50:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T344589)', diff saved to https://phabricator.wikimedia.org/P51729 and previous config saved to /var/cache/conftool/dbconfig/20230828-175019-ladsgroup.json [17:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T344589)', diff saved to https://phabricator.wikimedia.org/P51730 and previous config saved to /var/cache/conftool/dbconfig/20230828-175151-ladsgroup.json [17:56:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T344589)', diff saved to https://phabricator.wikimedia.org/P51731 and previous config saved to /var/cache/conftool/dbconfig/20230828-175630-ladsgroup.json [17:59:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P51732 and previous config saved to /var/cache/conftool/dbconfig/20230828-175911-ladsgroup.json [18:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:01:43] (03CR) 10Krinkle: Beta: Clean puppet cherry-picks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310719 (https://phabricator.wikimedia.org/T135427) (owner: 10Thcipriani) [18:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P51733 and previous config saved to /var/cache/conftool/dbconfig/20230828-180519-ladsgroup.json [18:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:07:03] (03CR) 10Gmodena: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952891 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [18:11:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P51734 and previous config saved to /var/cache/conftool/dbconfig/20230828-181136-ladsgroup.json [18:13:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T343718)', diff saved to https://phabricator.wikimedia.org/P51735 and previous config saved to /var/cache/conftool/dbconfig/20230828-181344-ladsgroup.json [18:13:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:14:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T343718)', diff saved to https://phabricator.wikimedia.org/P51736 and previous config saved to /var/cache/conftool/dbconfig/20230828-181417-ladsgroup.json [18:14:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [18:14:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2110.codfw.wmnet with reason: Maintenance [18:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T343718)', diff saved to https://phabricator.wikimedia.org/P51737 and previous config saved to /var/cache/conftool/dbconfig/20230828-181427-ladsgroup.json [18:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:18:54] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:20:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T343718)', diff saved to https://phabricator.wikimedia.org/P51738 and previous config saved to /var/cache/conftool/dbconfig/20230828-182025-ladsgroup.json [18:20:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:20:31] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:20:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:20:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:20:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T343718)', diff saved to https://phabricator.wikimedia.org/P51739 and previous config saved to /var/cache/conftool/dbconfig/20230828-182104-ladsgroup.json [18:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:26:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P51740 and previous config saved to /var/cache/conftool/dbconfig/20230828-182642-ladsgroup.json [18:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51741 and previous config saved to /var/cache/conftool/dbconfig/20230828-182851-ladsgroup.json [18:34:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [18:34:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [18:41:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T344589)', diff saved to https://phabricator.wikimedia.org/P51742 and previous config saved to /var/cache/conftool/dbconfig/20230828-184149-ladsgroup.json [18:43:51] (03PS1) 10Andrew Bogott: wmf_sink: catch ssl errors when talking to the proxy server [puppet] - 10https://gerrit.wikimedia.org/r/952919 (https://phabricator.wikimedia.org/T345103) [18:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P51743 and previous config saved to /var/cache/conftool/dbconfig/20230828-184357-ladsgroup.json [18:46:23] (03CR) 10Andrew Bogott: [C: 04-1] "Taavi and I both have a mild preference for not handling this exception so we notice that it's broken" [puppet] - 10https://gerrit.wikimedia.org/r/952919 (https://phabricator.wikimedia.org/T345103) (owner: 10Andrew Bogott) [18:49:26] (03CR) 10Bking: [C: 03+2] flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/952891 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [18:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T343718)', diff saved to https://phabricator.wikimedia.org/P51744 and previous config saved to /var/cache/conftool/dbconfig/20230828-184943-ladsgroup.json [18:49:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:52:55] !log bking@cumin1001 START - Cookbook sre.hosts.reboot-single for host wdqs1004.eqiad.wmnet [18:53:18] !log bking@cumin1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet [18:53:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [18:55:17] !log bking@cumin1001 depool wdqs1004 for firmware update [18:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:22] (03Merged) 10jenkins-bot: flink-app: Increment chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/952891 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [18:57:37] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:57:39] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [18:59:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T343718)', diff saved to https://phabricator.wikimedia.org/P51745 and previous config saved to /var/cache/conftool/dbconfig/20230828-185903-ladsgroup.json [18:59:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:59:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:59:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:59:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51746 and previous config saved to /var/cache/conftool/dbconfig/20230828-185924-ladsgroup.json [18:59:58] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs1004.eqiad.wmnet [19:01:01] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1004.eqiad.wmnet with OS bullseye [19:03:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P51747 and previous config saved to /var/cache/conftool/dbconfig/20230828-190449-ladsgroup.json [19:05:32] (03CR) 10Gmodena: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [19:08:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:09:38] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk2001.codfw.wmnet with OS bookworm [19:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:13:04] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1004.eqiad.wmnet with reason: host reimage [19:15:33] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1004.eqiad.wmnet with reason: host reimage [19:16:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:18:31] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:18:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:18:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T343718)', diff saved to https://phabricator.wikimedia.org/P51748 and previous config saved to /var/cache/conftool/dbconfig/20230828-191836-ladsgroup.json [19:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:18:44] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:19:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:19:38] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:19:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P51749 and previous config saved to /var/cache/conftool/dbconfig/20230828-191955-ladsgroup.json [19:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:25:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:25:28] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:26:56] (03PS1) 10Jgreen: Remove deprecated hosts frdm1001.frack.eqiad.wmnet and frlog1001.frack.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/952923 (https://phabricator.wikimedia.org/T317443) [19:29:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [19:33:30] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:33:32] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [19:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:33:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51750 and previous config saved to /var/cache/conftool/dbconfig/20230828-193342-ladsgroup.json [19:35:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T343718)', diff saved to https://phabricator.wikimedia.org/P51751 and previous config saved to /var/cache/conftool/dbconfig/20230828-193501-ladsgroup.json [19:35:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [19:35:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [19:35:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:35:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T343718)', diff saved to https://phabricator.wikimedia.org/P51752 and previous config saved to /var/cache/conftool/dbconfig/20230828-193511-ladsgroup.json [19:36:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1004.eqiad.wmnet with OS bullseye [19:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51753 and previous config saved to /var/cache/conftool/dbconfig/20230828-193626-ladsgroup.json [19:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:38:13] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host flink-zk2001.codfw.wmnet with OS bookworm [19:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:38:40] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:41:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:48:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P51754 and previous config saved to /var/cache/conftool/dbconfig/20230828-194848-ladsgroup.json [19:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:51:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51755 and previous config saved to /var/cache/conftool/dbconfig/20230828-195132-ladsgroup.json [19:51:39] (03PS1) 10Neil Shah-Quinn (WMF): Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) [19:56:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:57:07] (03PS1) 10Bking: flink-app: change HA config for test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) [19:59:40] (03PS2) 10Bking: flink-app: change HA config for test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T2000). [20:00:06] No Gerrit patches in the queue for this window AFAICS. [20:00:39] I actually have a comment-only patch for the backport, if anyone is available! [20:01:14] I just started looking for a window, and lo, there is one RIGHT NOW [20:02:21] comment-only? [20:02:32] oh, you mean changing a code comment? [20:02:37] please add it to the calendar regardless [20:02:52] Yup, I'm working on it right now [20:03:29] taavi: okay, done [20:03:37] (03CR) 10Bking: "Logstash has the errors if you want to see them: https://logstash.wikimedia.org/goto/bd5782a7ccc33102e7c22a2f6535c907" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:03:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:03:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T343718)', diff saved to https://phabricator.wikimedia.org/P51756 and previous config saved to /var/cache/conftool/dbconfig/20230828-200354-ladsgroup.json [20:03:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:04:02] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:04:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [20:04:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T343718)', diff saved to https://phabricator.wikimedia.org/P51757 and previous config saved to /var/cache/conftool/dbconfig/20230828-200415-ladsgroup.json [20:04:20] (03CR) 10Dwisehaupt: [C: 03+2] "Looks right. Shipit." [dns] - 10https://gerrit.wikimedia.org/r/952923 (https://phabricator.wikimedia.org/T317443) (owner: 10Jgreen) [20:05:13] nsq64: I'm pretty sure that comment is wrong (and has been for a while), as wikitech is definitely behind varnish and the rest of the caching layer [20:06:04] I would also appreciate a +1 from someone in traffic for adding a comment about a traffic-specific workflow [20:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:06:33] taavi: re: wikitech, that may be true; I just updated the comment to be clearer. [20:06:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P51758 and previous config saved to /var/cache/conftool/dbconfig/20230828-200639-ladsgroup.json [20:06:59] taavi: +1 on what? [20:07:05] bblack: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/952925/ [20:08:30] the mobile URL rules there, I assume, are for transforming desktop->mobile in outputs? [20:08:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:10:49] hi, back (sorry, don't usually use IRC). Apologies if I missed something. bblack: The rules in mediawiki-config define the URL where the MobileFrontend site should be served. The rules in Varnish define the URL redirection that should be applied for mobile devices trying to access the desktop version. [20:11:52] currently, the workflow is: (1) someone updates the rules in mediawiki-config and (2) at some point someone realizes that the rules should also be updated in Varnish. This just happened for Wikifunctions. [20:13:02] I just added another mirrored copy of the rules, to help data scientists translate between mobile and desktop URL (https://phabricator.wikimedia.org/T344080). [20:13:07] yeah I was just trying to understanding functionally what happens with the rules in wmf-config. Because I don't think they're used to parse incoming traffic, but maybe to generate outputs? [20:13:53] bblack: yes, generating output is my understanding [20:13:59] ok [20:14:25] (03CR) 10Ebernhardson: [C: 03+1] flink-app: change HA config for test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:15:25] is wikifunctions.org using just https://m.wikifunctions.org/ ? (no language subdomains, right?) [20:15:46] just checking, because that's all the TLS cert is set up for [20:15:48] bblack: yup, correct [20:15:57] like Wikidata [20:16:02] ok [20:16:33] wikitech is behind varnish, but we don't process mobile redirects for it [20:17:07] (03PS1) 10Sohom Datta: Allow loading Edit-in-Sequence as a beta feature on Wikisources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952928 (https://phabricator.wikimedia.org/T308098) [20:17:51] (we treat it more like we do all our non-wiki services, e.g. grafana or phab) [20:18:03] bblack: oh, interesting. Does it not mess with the caching that the mobile and desktop versions are served from the same URL? Or does it just not matter because Wikitech is so very small? [20:18:24] is there even a mobile version? UA-detected or something? [20:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:19:13] if it's detecting UAs to put out alternate mobile content, probably it gets blended up in the caches (whichever variant caches first wins for other users for a while) [20:19:48] bblack: I *think* it's all just ?mobileaction=toggle_view_mobile at wikitech if you want the minerva skin with data munging [20:20:00] bblack: there is a mobile version, if you click to switch in the footer. There's just no mobile redirection. [20:20:07] ah [20:20:31] I wonder if it's cacheable. probably not many of us use it, so it might be a problem flying under the radar all this time. [20:21:01] we do occasionally see cache polution from it, but most of us who would care are generally logged in on wikitech too [20:21:17] bblack: just FYI since we're on the subject, while digging into this, I did notice that the Varnish mobile redirection misses some tiny sites that do have separate mobile URLs: https://phabricator.wikimedia.org/T344175 [20:21:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T343718)', diff saved to https://phabricator.wikimedia.org/P51759 and previous config saved to /var/cache/conftool/dbconfig/20230828-202145-ladsgroup.json [20:21:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [20:21:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:22:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [20:22:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T343718)', diff saved to https://phabricator.wikimedia.org/P51760 and previous config saved to /var/cache/conftool/dbconfig/20230828-202206-ladsgroup.json [20:23:24] (03CR) 10Bking: [C: 03+2] flink-app: change HA config for test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:23:26] (03PS2) 10Neil Shah-Quinn (WMF): Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) [20:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:24:06] nsq64: yeah this is an interesting topic in general. In the short term, if there's a few people are complaining about for the redirect, we can add them. [20:24:10] (03Merged) 10jenkins-bot: flink-app: change HA config for test [deployment-charts] - 10https://gerrit.wikimedia.org/r/952927 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:24:38] in the long term, someone should really clean up everything about m-dot and how it works across our stack :P [20:25:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:25:44] bblack: I haven't noticed anyone actually complaining, so I don't think there's any strong reason to address it. It just seemed right to make a task for it :) [20:25:48] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:25:48] (also the signalling part: mediawiki doesn't parse the m-dot URLs in requests. It relies on varnish parsing them, transforming them back to desktop URLs, then tacking on an "X-Subdomain: M" header) [20:26:21] it would be cleaner all around if MW could parse the mobile URLs for itself directly. [20:26:56] and the mobile UA regex in varnish is ancient and really hasn't been maintained, but by some miracle manages to still be useful [20:27:48] !log clear pre-upgrade aqs snapshots — T339299 [20:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:53] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [20:28:18] (03CR) 10BBlack: [C: 03+1] Add comment about mirroring of wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952925 (https://phabricator.wikimedia.org/T344185) (owner: 10Neil Shah-Quinn (WMF)) [20:28:29] anyways, comments LGTM. everything else is a rabbithole :) [20:29:15] bblack: hah, yeah, maybe someone will have a chance to plumb the rabbithole sometime this decade '=D [20:29:16] thanks for the +1! [20:29:46] np [20:29:51] taavi: Traffic has given a +1 so it's okay to deploy if you're available [20:32:08] (03PS1) 10Bking: flink-app: remove "high-availability.cluster-id" config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/952929 (https://phabricator.wikimedia.org/T344614) [20:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T343718)', diff saved to https://phabricator.wikimedia.org/P51761 and previous config saved to /var/cache/conftool/dbconfig/20230828-204119-ladsgroup.json [20:41:25] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T343718)', diff saved to https://phabricator.wikimedia.org/P51762 and previous config saved to /var/cache/conftool/dbconfig/20230828-204153-ladsgroup.json [20:44:50] taavi: FYI, I just moved the patch to a future window [20:48:36] (03CR) 10TChin: [C: 03+1] flink-app: remove "high-availability.cluster-id" config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/952929 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:56:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P51763 and previous config saved to /var/cache/conftool/dbconfig/20230828-205625-ladsgroup.json [20:56:47] (03CR) 10Bking: [C: 03+2] flink-app: remove "high-availability.cluster-id" config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/952929 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51764 and previous config saved to /var/cache/conftool/dbconfig/20230828-205700-ladsgroup.json [20:57:33] (03Merged) 10jenkins-bot: flink-app: remove "high-availability.cluster-id" config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/952929 (https://phabricator.wikimedia.org/T344614) (owner: 10Bking) [20:58:09] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:58:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [20:58:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T343718)', diff saved to https://phabricator.wikimedia.org/P51765 and previous config saved to /var/cache/conftool/dbconfig/20230828-205851-ladsgroup.json [20:58:56] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [21:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230828T2100). [21:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P51766 and previous config saved to /var/cache/conftool/dbconfig/20230828-211131-ladsgroup.json [21:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P51767 and previous config saved to /var/cache/conftool/dbconfig/20230828-211206-ladsgroup.json [21:13:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51768 and previous config saved to /var/cache/conftool/dbconfig/20230828-211357-ladsgroup.json [21:20:59] (03PS1) 10HMonroy: wikidiff2: set maxSplitSize = 10 by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952940 (https://phabricator.wikimedia.org/T341754) [21:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:23:20] (03PS1) 10Bking: wdqs: Add new federation endpoint [puppet] - 10https://gerrit.wikimedia.org/r/952941 (https://phabricator.wikimedia.org/T337296) [21:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:26:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T343718)', diff saved to https://phabricator.wikimedia.org/P51769 and previous config saved to /var/cache/conftool/dbconfig/20230828-212637-ladsgroup.json [21:26:38] (03PS2) 10Ryan Kemper: wdqs: Add new federation endpoint [puppet] - 10https://gerrit.wikimedia.org/r/952941 (https://phabricator.wikimedia.org/T337296) (owner: 10Bking) [21:26:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [21:26:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [21:26:43] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:26:43] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [21:26:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T343718)', diff saved to https://phabricator.wikimedia.org/P51770 and previous config saved to /var/cache/conftool/dbconfig/20230828-212647-ladsgroup.json [21:27:05] (03CR) 10Bking: [C: 03+1] wdqs: Add new federation endpoint [puppet] - 10https://gerrit.wikimedia.org/r/952941 (https://phabricator.wikimedia.org/T337296) (owner: 10Bking) [21:27:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T343718)', diff saved to https://phabricator.wikimedia.org/P51771 and previous config saved to /var/cache/conftool/dbconfig/20230828-212712-ladsgroup.json [21:27:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [21:27:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [21:27:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T343718)', diff saved to https://phabricator.wikimedia.org/P51772 and previous config saved to /var/cache/conftool/dbconfig/20230828-212733-ladsgroup.json [21:28:18] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: Add new federation endpoint [puppet] - 10https://gerrit.wikimedia.org/r/952941 (https://phabricator.wikimedia.org/T337296) (owner: 10Bking) [21:29:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P51773 and previous config saved to /var/cache/conftool/dbconfig/20230828-212903-ladsgroup.json [21:32:48] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [21:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:34:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:34:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:36:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:37:12] (SystemdUnitFailed) resolved: (5) wdqs-blazegraph.service Failed on wdqs1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:40:49] (03PS3) 10Ryan Kemper: wdqs1005: Disable notifications and remove from lvs [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) (owner: 10Bking) [21:42:11] (03PS4) 10Ryan Kemper: wdqs1005: Disable notifications and remove from lvs [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) (owner: 10Bking) [21:42:44] (03PS5) 10Ryan Kemper: wdqs1005: Disable notifications and remove from lvs [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) (owner: 10Bking) [21:44:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T343718)', diff saved to https://phabricator.wikimedia.org/P51774 and previous config saved to /var/cache/conftool/dbconfig/20230828-214409-ladsgroup.json [21:44:15] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [21:45:11] !log bking@cumin1001 conftool action : set/pooled=yes; selector: name=wdqs1004.eqiad.wmnet [21:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:48:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:49:57] (03PS6) 10Ryan Kemper: wdqs1005: Disable notifications and remove from lvs [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) (owner: 10Bking) [21:49:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:49:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:51:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:51:57] (03CR) 10Ryan Kemper: [C: 03+2] wdqs1005: Disable notifications and remove from lvs [puppet] - 10https://gerrit.wikimedia.org/r/952864 (https://phabricator.wikimedia.org/T344198) (owner: 10Bking) [21:52:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T343718)', diff saved to https://phabricator.wikimedia.org/P51775 and previous config saved to /var/cache/conftool/dbconfig/20230828-215200-ladsgroup.json [21:52:07] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [21:53:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:56:09] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: T337296 restart services for new federation endpoint [21:56:15] T337296: Allow federated queries with the NLG endpoint (data.nlg.gr) - https://phabricator.wikimedia.org/T337296 [21:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:57:21] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: T337296 restart services for new federation endpoint (duration: 01m 12s) [22:01:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:06:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P51776 and previous config saved to /var/cache/conftool/dbconfig/20230828-220706-ladsgroup.json [22:07:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T343718)', diff saved to https://phabricator.wikimedia.org/P51777 and previous config saved to /var/cache/conftool/dbconfig/20230828-220747-ladsgroup.json [22:07:58] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [22:11:04] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952783 (https://phabricator.wikimedia.org/T344954) (owner: 10Filippo Giunchedi) [22:12:02] 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10decommission-hardware: hw troubleshooting: ipmi down for wdqs1005.eqiad.wmnet - https://phabricator.wikimedia.org/T345081 (10RKemper) a:03Papaul [22:14:23] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [22:14:23] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:14:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [22:18:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:18:55] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:22:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P51778 and previous config saved to /var/cache/conftool/dbconfig/20230828-222212-ladsgroup.json [22:22:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51779 and previous config saved to /var/cache/conftool/dbconfig/20230828-222254-ladsgroup.json [22:23:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:26:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:30:39] 10SRE, 10SRE-swift-storage: Install NVMe SSDs into moss-be200[1|2] & thanos-be200? - https://phabricator.wikimedia.org/T310923 (10Papaul) [22:32:40] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:33:19] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [22:33:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:37:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T343718)', diff saved to https://phabricator.wikimedia.org/P51780 and previous config saved to /var/cache/conftool/dbconfig/20230828-223719-ladsgroup.json [22:37:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:37:25] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [22:37:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [22:37:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T343718)', diff saved to https://phabricator.wikimedia.org/P51781 and previous config saved to /var/cache/conftool/dbconfig/20230828-223740-ladsgroup.json [22:38:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P51782 and previous config saved to /var/cache/conftool/dbconfig/20230828-223800-ladsgroup.json [22:38:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10decommission-hardware: hw troubleshooting: ipmi down for wdqs1005.eqiad.wmnet - https://phabricator.wikimedia.org/T345081 (10Papaul) @Jclark-ctr @VRiley-WMF can someone please check the mgmt cable for this servers, I can not ping the mgmt IP if the cab... [22:41:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:42:17] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Relabel: puppetmaster1006 to puppetserver1002 - https://phabricator.wikimedia.org/T345080 (10Papaul) a:03VRiley-WMF @VRiley-WMF can you please take care of this when back onsite? once done you can resolve... [22:44:15] 10SRE, 10ops-eqiad, 10serviceops: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 (10Papaul) a:03VRiley-WMF [22:46:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:47:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10Papaul) Blocked on https://phabricator.wikimedia.org/T338789 [22:48:58] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344976 (10Papaul) 05Open→03Resolved a:03Papaul Working on this @ https://phabricator.wikimedia.org/T345081 [22:53:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T343718)', diff saved to https://phabricator.wikimedia.org/P51783 and previous config saved to /var/cache/conftool/dbconfig/20230828-225306-ladsgroup.json [22:53:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [22:53:12] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [22:53:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [23:11:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:11:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:11:54] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [23:11:55] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [23:11:57] (03PS2) 10Krinkle: Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling) [23:12:16] (03CR) 10Krinkle: [C: 03+1] Raise LoginNotify minimum log level to info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952564 (https://phabricator.wikimedia.org/T174200) (owner: 10Tim Starling) [23:20:36] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.restart [23:21:29] (RedisMemoryFull) firing: (4) Redis memory full on rdb1012:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:23:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [23:23:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [23:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T343718)', diff saved to https://phabricator.wikimedia.org/P51784 and previous config saved to /var/cache/conftool/dbconfig/20230828-232344-ladsgroup.json [23:23:50] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [23:56:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T343718)', diff saved to https://phabricator.wikimedia.org/P51785 and previous config saved to /var/cache/conftool/dbconfig/20230828-235648-ladsgroup.json [23:56:54] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718