[00:11:42] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:16:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:21:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:26:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951883 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951883 (owner: 10TrainBranchBot) [00:41:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:46:27] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [00:53:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/951883 (owner: 10TrainBranchBot) [01:21:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:26:27] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:36:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [01:38:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [02:06:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:08:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:33:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:53] (SystemdUnitFailed) firing: (3) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:46:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [02:56:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:06:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:21:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:25:31] PROBLEM - Check systemd state on db1146 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:35] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:31:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:37:19] (03PS3) 10Cwhite: grafana: ensure prometheus/global datasources removed [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) [03:44:07] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:46:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [03:51:25] RECOVERY - Check systemd state on db1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:01:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:11:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:16:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:31:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:36:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:41:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:49:18] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:54:18] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:01:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:01:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P51424 and previous config saved to /var/cache/conftool/dbconfig/20230825-050147-ladsgroup.json [05:11:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:16:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:16:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P51425 and previous config saved to /var/cache/conftool/dbconfig/20230825-051651-ladsgroup.json [05:24:48] (03PS1) 10Marostegui: dbproxy1016: This host was decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/952295 [05:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:26:45] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/952296 [05:26:55] (03CR) 10Marostegui: [C: 03+2] dbproxy1016: This host was decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/952295 (owner: 10Marostegui) [05:28:20] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/952296 (owner: 10Marostegui) [05:28:33] !log failover m3-master to dbproxy1020 [05:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:31:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P51426 and previous config saved to /var/cache/conftool/dbconfig/20230825-053156-ladsgroup.json [05:35:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:36:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:47:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P51427 and previous config saved to /var/cache/conftool/dbconfig/20230825-054701-ladsgroup.json [05:53:15] James_F: did you add wikifunctions to https://meta.wikimedia.org/wiki/Wikimedia_URL_Shortener allow list? [05:53:24] we should list all of these somewhere [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230825T0600) [06:02:29] (03PS1) 10Ladsgroup: Revert "db1178: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/952134 [06:02:36] (03PS2) 10Ladsgroup: Revert "db1178: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/952134 [06:03:09] (03CR) 10Ladsgroup: [C: 03+2] Revert "db1178: Disable notification" [puppet] - 10https://gerrit.wikimedia.org/r/952134 (owner: 10Ladsgroup) [06:06:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:16:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:29:56] (03CR) 10Tim Starling: [C: 04-1] "Can you please add a case to the unit test file multi-dc_test.lua?" [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [06:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:33:53] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:33:53] (SystemdUnitFailed) firing: (3) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/951882 (https://phabricator.wikimedia.org/T288196) (owner: 10Cwhite) [06:36:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:41:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:49:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [06:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:53:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [06:54:32] dduvall, hello, I'd like to do an emergency deploy for: 951487: ext.uls.interface.js: Inline isNamed() method | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+/951487 -- context is https://phabricator.wikimedia.org/T344635 [06:55:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [06:56:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:56:29] this patch will fix an issue that is causing a large spike in JS errors [06:58:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [06:59:45] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10MoritzMuehlenhoff) [07:00:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230825T0700) [07:01:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:01:34] (03PS1) 10Muehlenhoff: docker build: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/952300 [07:01:37] (03PS1) 10Filippo Giunchedi: sre: move KubernetesAPINotScrapable to k8s-specific alerts [alerts] - 10https://gerrit.wikimedia.org/r/952301 (https://phabricator.wikimedia.org/T343529) [07:03:36] (03CR) 10JMeybohm: [C: 03+2] PKI: Rename the aux profile to match the naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/952243 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:04:12] (03CR) 10JMeybohm: [C: 03+2] aux: Rename the aux profile to match the naming scheme [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:04:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast5004.wikimedia.org [07:04:40] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:05:33] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:05:38] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:06:00] !log installing cups security updates [07:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:06:35] (03Merged) 10jenkins-bot: aux: Rename the aux profile to match the naming scheme [deployment-charts] - 10https://gerrit.wikimedia.org/r/952246 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [07:07:13] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [07:07:24] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:07:41] (03PS1) 10Muehlenhoff: profile::ci::package_builder::extra_packages: Remove Stretch support [puppet] - 10https://gerrit.wikimedia.org/r/952302 [07:09:44] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002" [07:09:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [07:10:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5004.wikimedia.org - jmm@cumin2002" [07:10:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:10:29] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast5004.wikimedia.org on all recursors [07:10:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast5004.wikimedia.org on all recursors [07:10:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002" [07:11:06] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895) (owner: 10Ilias Sarantopoulos) [07:11:09] (03CR) 10Physikerwelt: "I would directly jump to 42." [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [07:11:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:11:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast5004.wikimedia.org - jmm@cumin2002" [07:12:01] (03Merged) 10jenkins-bot: ml-services: update enwiki articlequality model [deployment-charts] - 10https://gerrit.wikimedia.org/r/952230 (https://phabricator.wikimedia.org/T344895) (owner: 10Ilias Sarantopoulos) [07:12:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast5004.wikimedia.org with OS bookworm [07:12:11] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm [07:13:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [07:15:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [07:17:30] (03CR) 10Jelto: [C: 03+2] miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [07:17:51] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:18:21] (03Merged) 10jenkins-bot: miscweb: migrate bugzilla image to GitLab [deployment-charts] - 10https://gerrit.wikimedia.org/r/952228 (https://phabricator.wikimedia.org/T343914) (owner: 10Jelto) [07:19:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [07:20:30] (03CR) 10Clément Goubert: [V: 03+1] envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [07:20:37] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [07:20:40] (03CR) 10Clément Goubert: [V: 03+2 C: 03+2] envoy: Add concurrency control to envoy cmdline [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952148 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [07:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:21:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:23:50] (03CR) 10Clément Goubert: [C: 03+2] mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [07:24:22] (03Merged) 10jenkins-bot: mesh: Add concurrency control for envoy workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/952158 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [07:29:26] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [07:30:41] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [07:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:36:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:37:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [07:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:44:41] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:49:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2002.codfw.wmnet [07:50:12] (03CR) 10JMeybohm: [C: 03+1] sre: move KubernetesAPINotScrapable to k8s-specific alerts [alerts] - 10https://gerrit.wikimedia.org/r/952301 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [07:51:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:53:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2002.codfw.wmnet [07:54:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1002.eqiad.wmnet [07:58:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1002.eqiad.wmnet [08:01:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:03:48] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:03:54] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:04:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast5004.wikimedia.org with OS bookworm [08:04:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast5004.wikimedia.org [08:04:40] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast5004.wikimedia.org with OS bookworm executed with errors: - bast5004 (**FAIL**) - Removed from Puppet... [08:04:46] (03PS4) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) [08:04:58] 10SRE, 10serviceops, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10Joe) 05Open→03Resolved We've since moved to using modules. [08:09:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast6003.wikimedia.org [08:09:31] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:13:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344976 (10phaultfinder) [08:16:53] !log disabling puppet on acme-chief clients prior to acmechief2001 reboot [08:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host acmechief2001.codfw.wmnet [08:20:03] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6003.wikimedia.org - jmm@cumin2002" [08:21:35] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief2001.codfw.wmnet [08:22:39] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:22:46] liar :) [08:23:02] that's some serious lag [08:23:26] !log re-enabling puppet on acme-chief clients [08:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:27:39] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:32:50] (03PS1) 10JMeybohm: PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952309 (https://phabricator.wikimedia.org/T344253) [08:33:15] !log stopping puppet and pybal on lvs2014 for reboot (T344587) [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:24] (03PS5) 10Clément Goubert: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) [08:33:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] PKI: Rename aux key to match the naming scheme of everything else [labs/private] - 10https://gerrit.wikimedia.org/r/952309 (https://phabricator.wikimedia.org/T344253) (owner: 10JMeybohm) [08:35:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast6003.wikimedia.org - jmm@cumin2002" [08:35:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:13] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast6003.wikimedia.org on all recursors [08:35:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast6003.wikimedia.org on all recursors [08:35:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast6003.wikimedia.org - jmm@cumin2002" [08:36:10] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast6003.wikimedia.org - jmm@cumin2002" [08:36:26] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2014.codfw.wmnet [08:39:20] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2014.codfw.wmnet [08:39:32] (03PS2) 10Clément Goubert: mediawiki: Generalize tls-proxy limits removal [deployment-charts] - 10https://gerrit.wikimedia.org/r/952171 (https://phabricator.wikimedia.org/T344814) [08:40:06] !log [08:40:06] fabfur: Message missing. Nothing logged. [08:40:21] !log started puppet and pybal on lvs2014 (T344587) [08:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:44] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [08:41:09] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [08:41:56] !log jnuche@deploy1002 Installing scap version "4.58.0" for 1 hosts [08:42:07] (03Merged) 10jenkins-bot: mediawiki: Remove limits for tls-proxy container [deployment-charts] - 10https://gerrit.wikimedia.org/r/952159 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [08:42:09] !log jnuche@deploy1002 Installation of scap version "4.58.0" completed for 1 hosts [08:43:46] jouncebot: nowandnext [08:43:46] For the next 22 hour(s) and 16 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230825T0700) [08:43:46] In 22 hour(s) and 16 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230826T0700) [08:43:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast6003.wikimedia.org with OS bookworm [08:44:00] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host bast6003.wikimedia.org with OS bookworm [08:44:25] !log mw-debug: Remove limits for tls-proxy container - T344814 [08:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:29] T344814: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 [08:44:36] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:45:55] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10Joe) A couple things: first of all, the recent issues with repooling esams were mostly due to the insufficient caching in its backend, even more than starting with a completely cold... [08:46:44] !log stopping puppet and pybal on lvs2013 for reboot (T344587) [08:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:42] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:48:38] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:49:38] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:51:32] (03PS1) 10Clément Goubert: admin_ng: Remove limits on mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952314 (https://phabricator.wikimedia.org/T344814) [08:51:46] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [08:51:58] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [08:55:03] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:55:42] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:55:44] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [08:57:38] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Test depool of drmrs - https://phabricator.wikimedia.org/T344968 (10KOfori) @Joe certainly not now with all the trouble but at some point when things have stabilized, we should do it. [09:00:07] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:01:57] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Remove limits on mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952314 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:02:03] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:04:20] (03Merged) 10jenkins-bot: admin_ng: Remove limits on mw-debug namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952314 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:05:38] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:05:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast5004.wikimedia.org [09:07:30] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:07:46] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:07:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host bast6003.wikimedia.org with OS bookworm [09:07:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host bast6003.wikimedia.org [09:07:52] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host bast6003.wikimedia.org with OS bookworm executed with errors: - bast6003 (**FAIL**) - Removed from Puppet... [09:08:19] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:08:26] !log eoghan@cumin1001 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet [09:08:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:08:42] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=upload&var-origin=swift.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:08:58] uh [09:09:51] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:10:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast5004.wikimedia.org [09:10:19] Looking [09:10:28] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:11:15] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:11:17] eoghan: swift@codfw is struggling for some reason [09:12:16] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet [09:14:28] !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet [09:14:35] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:14:43] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [09:15:04] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:15:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet [09:15:16] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:06] Emperor: are you around? [09:16:40] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2013.codfw.wmnet [09:20:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [09:20:14] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [09:21:24] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1129 days) https://wikitech.wikimedia.org/wiki/Logs [09:22:11] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2013.codfw.wmnet [09:22:18] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:22:28] PROBLEM - pybal on lvs2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:22:42] (03PS1) 10Clément Goubert: mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) [09:23:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:23] topranks, eoghan could you restart swift frontends in codfw? [09:23:42] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:23:45] vgutierrez: we could yes [09:23:51] I can try, but not sure how to do that. [09:23:59] I was looking at some related graphs there don't see any signs of problem [09:24:00] !log enabled puppet and pybal on lvs2013 for reboot (T344587) [09:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [09:24:16] RECOVERY - pybal on lvs2013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:24:16] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:24:19] topranks: I'm currently fetching some logs on ATS [09:24:21] vgutierrez: what's up? [09:24:32] Emperor: swift@codfw returning 502s [09:24:36] Emperor: Swift in codfw is unhappy. [09:24:37] (03CR) 10CI reject: [V: 04-1] mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:24:54] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:24:59] huh, yes, that is quite a marked spike [09:25:04] Maybe. The graphs linked in the wikitech pages look ok. But not sure if that's the whole story. [09:25:12] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:19] vgutierrez: do you want to check ATS before I restart codfw frontends? [09:25:25] Emperor: I've done that already [09:25:39] I can confirm that ATS is seeing 502s coming from swift and aren't internal ones [09:26:27] (03PS2) 10Clément Goubert: mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) [09:26:29] Emperor: you're restarting those backends? [09:26:36] frontends not backends [09:26:50] sry yeah, the -fe nodes right? [09:28:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:14] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 79 connections established with conf2004.codfw.wmnet:4001 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [09:28:24] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [09:29:37] sudo cookbook sre.swift.roll-restart-reboot-swift-ms-proxies --alias swift-fe-codfw --reason 'spike in 502s' restart_daemons #<-- relevant rune [09:29:59] (03PS3) 10Clément Goubert: mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) [09:31:12] This is very very frustrating, swift logs nothing at all useful [09:31:38] maybe a slight increase in lines like 'ms-fe2009 proxy-server: Client disconnected on read of' [09:32:03] Is it OK to do emergency deployment of: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/UniversalLanguageSelector/+/951487 [09:32:55] kart_: How urgent is it? Perhaps it might be better to wait until we've sorted the Swift problem if it can hold off a few minutes? [09:33:17] Sure. I can wait. [09:33:42] (ATSBackendErrorsHigh) resolved: (2) ATS: elevated 5xx errors from swift.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [09:34:14] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [09:34:23] vgutierrez: from just codfw frontends or eqiad as well? [09:34:49] codfw is the only one impacted apparently [09:34:52] ms-fe2010, ms-fe2011 and ms-fe2014 all had a sharp rise in tcp connections around when it started [09:35:06] I've now restarted all the codfw frontends [09:35:17] we get the page from ATS@eqsin cause eqsin is the busiest DC that uses codfw backends [09:35:21] those stats seem to be returning to normal since the restart [09:35:29] in codfw i'm restarting the lvs hosts, but this should not affect [09:35:43] 500s down significantly since the restart too [09:35:54] (03PS4) 10Clément Goubert: mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) [09:36:33] yeah.. atslog-backend RespStatus:502 on cp5026 shows that 502s are gone [09:36:35] vgutierrez: and _definitely_ from a swift fe server, nothing else in the stack ? [09:37:01] Emperor: swift fe or whatever software is terminating the TLS connection in front of it [09:37:51] nginx in the case of swift [09:38:24] a bad gateway error could be explained by swift-fe not being able to answer an incoming request from nginx [09:38:34] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:38:39] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:39:11] as suggested in the past we should move from nginx to envoy.. as a side effect we would get some interesting metrics on how upstream (fe daemon) behaves [09:39:49] vgutierrez: how hard is that move going to be? [09:40:00] [nginx logs are uninteresting as ever] [09:40:06] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: move KubernetesAPINotScrapable to k8s-specific alerts [alerts] - 10https://gerrit.wikimedia.org/r/952301 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [09:40:31] I know you had a phab task about it for a while, which was stuck on all the frontends being on bullseye, but they are all up to date now [09:40:53] that's T317616 [09:40:54] T317616: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 [09:40:57] (03Merged) 10jenkins-bot: mw-debug: Fix resourcequota on namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/952316 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:41:20] kart_: I think things have stabilised enough that you should be good now. Many thanks for your patience! [09:41:40] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:41:42] maybe I should ask that question on the ticket [09:41:48] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:42:08] kart_: Give me a minute to finish up something on mw-on-k8s [09:42:12] eoghan: Thanks! [09:42:19] claime: sure. Please ping me :) [09:42:21] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:42:28] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:42:32] ack [09:42:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:42:42] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10MatthewVernon) @Vgutierrez how hard is moving swift frontends to using envoy in place of nginx likely to be? If it could give us better visibility of what is going on in these occasional s... [09:43:47] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:43:54] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:44:01] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:44:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:44:32] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:44:55] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:46:25] (03PS1) 10Clément Goubert: Revert "mediawiki: Remove limits for tls-proxy container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952137 [09:46:59] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:25] (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki: Remove limits for tls-proxy container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952137 (owner: 10Clément Goubert) [09:47:35] !log disabling puppet and pybal on lvs2012 for reboot (T344587) [09:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:22] (03Merged) 10jenkins-bot: Revert "mediawiki: Remove limits for tls-proxy container" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952137 (owner: 10Clément Goubert) [09:48:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:49:27] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:55] PROBLEM - PyBal backends health check on lvs2012 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [09:50:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast6003.wikimedia.org [09:50:57] PROBLEM - pybal on lvs2012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [09:51:02] (03PS1) 10Clément Goubert: mediawiki: Fix revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/952319 (https://phabricator.wikimedia.org/T344814) [09:51:27] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:52:09] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:52:11] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Fix revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/952319 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:52:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:53:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:53:13] (03Merged) 10jenkins-bot: mediawiki: Fix revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/952319 (https://phabricator.wikimedia.org/T344814) (owner: 10Clément Goubert) [09:54:06] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:54:11] (03CR) 10Jbond: "lgtm just need to fix the rspec test" [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [09:54:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast6003.wikimedia.org [09:55:00] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:55:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [09:55:13] PROBLEM - PyBal connections to etcd on lvs2012 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [09:55:22] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:55:37] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1613 days) https://wikitech.wikimedia.org/wiki/Logs [09:56:06] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [09:56:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:57:11] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [09:57:23] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [09:57:31] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [09:57:35] kart_: you're good, sorry for the delay [09:57:43] claime: cool. [09:58:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:00:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951487 (https://phabricator.wikimedia.org/T344635) (owner: 10Abijeet Patro) [10:03:23] "10:00:48 Retrying (Retry(total=9, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /r/changes/951487/detail?o=COMMIT_FOOTERS&o=ALL_REVISIONS" --> That's new. [10:03:42] While backporting 951487. [10:03:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new bastion - jmm@cumin2002" [10:04:49] 10sre-alert-triage, 10Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10fgiunchedi) >>! In T342755#9118330, @thcipriani wrote: >>>! In T342755#9115945, @fgiunchedi wrote: >>>>! In T342755#9114368, @thcipriani wrote: >>> Hrm. We get an email from the s... [10:06:04] (03PS1) 10Muehlenhoff: new bastions in ulsfo/eqsin/drmrs [puppet] - 10https://gerrit.wikimedia.org/r/952320 (https://phabricator.wikimedia.org/T343515) [10:06:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new bastion - jmm@cumin2002" [10:06:44] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2012.codfw.wmnet [10:08:54] kart_: gerrit connection issue? [10:09:29] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:09:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [10:09:38] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2012.codfw.wmnet [10:10:04] !log enabled puppet and pybal on lvs2012 (T344587) [10:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:21] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:25] RECOVERY - pybal on lvs2012 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:10:27] RECOVERY - PyBal backends health check on lvs2012 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:10:45] 10SRE, 10SRE-swift-storage, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) As a side effect of moving to envoy we would be getting https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1 data for swift. As stated in the task description the... [10:11:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:11:29] RECOVERY - PyBal connections to etcd on lvs2012 is OK: OK: 6 connections established with conf2004.codfw.wmnet:4001 (min=6) https://wikitech.wikimedia.org/wiki/PyBal [10:12:35] (03CR) 10Muehlenhoff: [C: 03+2] new bastions in ulsfo/eqsin/drmrs [puppet] - 10https://gerrit.wikimedia.org/r/952320 (https://phabricator.wikimedia.org/T343515) (owner: 10Muehlenhoff) [10:13:57] (03Merged) 10jenkins-bot: ext.uls.interface.js: Inline isNamed() method [extensions/UniversalLanguageSelector] (wmf/1.41.0-wmf.23) - 10https://gerrit.wikimedia.org/r/951487 (https://phabricator.wikimedia.org/T344635) (owner: 10Abijeet Patro) [10:14:22] !log kartik@deploy1002 Started scap: Backport for [[gerrit:951487|ext.uls.interface.js: Inline isNamed() method (T344635)]] [10:14:27] T344635: Language selector broken: Uncaught TypeError: mw.uls.isNamed is not a function - https://phabricator.wikimedia.org/T344635 [10:14:53] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:15:45] !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:951487|ext.uls.interface.js: Inline isNamed() method (T344635)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [10:18:55] <_joe_> wait [10:19:05] <_joe_> are we rebooting LVS servers while deploying? [10:20:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:38] _joe_: yes, that should be no more an issue [10:20:43] afaik [10:22:54] !log kartik@deploy1002 abi and kartik: Continuing with sync [10:24:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:39] (03PS1) 10Jbond: puppetdb-microservice: check the query object for type [puppet] - 10https://gerrit.wikimedia.org/r/952324 [10:24:54] <_joe_> fabfur: did anyone check it isn't? [10:25:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:25:14] (03CR) 10Jbond: [C: 03+2] puppetdb-microservice: check the query object for type [puppet] - 10https://gerrit.wikimedia.org/r/952324 (owner: 10Jbond) [10:25:35] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetdb-microservice: check the query object for type [puppet] - 10https://gerrit.wikimedia.org/r/952324 (owner: 10Jbond) [10:26:25] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1002 is OK: SSL OK - Certificate centrallog1002.eqiad.wmnet valid until 2028-01-24 19:33:10 +0000 (expires in 1613 days) https://wikitech.wikimedia.org/wiki/Logs [10:26:31] (03PS1) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [10:26:56] (03CR) 10CI reject: [V: 04-1] Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [10:28:29] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:951487|ext.uls.interface.js: Inline isNamed() method (T344635)]] (duration: 14m 06s) [10:28:34] T344635: Language selector broken: Uncaught TypeError: mw.uls.isNamed is not a function - https://phabricator.wikimedia.org/T344635 [10:29:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:29:34] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:33:28] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Physikerwelt) [10:33:53] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:33:54] (SystemdUnitFailed) firing: (3) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:09] (03PS1) 10Peter Fischer: Disable search result deduplication. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952346 (https://phabricator.wikimedia.org/T341227) [10:34:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:04] _joe_: yes, that problem with lvs reboots and deployment should be definitely fixed now [10:35:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:35:23] 10SRE, 10API Platform, 10Anti-Harassment, 10Content-Transform-Team, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10Physikerwelt) [10:35:26] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1129 days) https://wikitech.wikimedia.org/wiki/Logs [10:35:33] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:38:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:38:27] !log disabled puppet and pybal on lvs2011 for reboot (T344587) [10:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:18] PROBLEM - PyBal backends health check on lvs2011 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [10:40:26] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:30] PROBLEM - PyBal connections to etcd on lvs2011 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:40:33] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:42] PROBLEM - pybal on lvs2011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [10:41:56] (03Abandoned) 10Btullis: Upload the spark3-assemly file to HDFS on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) (owner: 10Btullis) [10:45:14] PROBLEM - Check systemd state on ml-serve1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:47:21] (03PS1) 10Clément Goubert: mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) [10:47:40] RECOVERY - Check systemd state on ml-serve1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:40] (03PS2) 10Clément Goubert: mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) [10:48:38] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:50:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:51:38] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [10:52:31] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) (owner: 10Clément Goubert) [10:55:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:55:17] (03PS1) 10Urbanecm: Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) [10:55:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) (owner: 10Clément Goubert) [10:55:39] (03CR) 10Urbanecm: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [10:56:23] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) (owner: 10Clément Goubert) [10:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [10:56:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host lvs2011.codfw.wmnet [10:57:09] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [10:57:33] (03CR) 10CI reject: [V: 04-1] Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [10:57:39] (03Merged) 10jenkins-bot: mediawiki: Add missing controls for php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/952349 (https://phabricator.wikimedia.org/T341320) (owner: 10Clément Goubert) [10:59:35] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs2011.codfw.wmnet [10:59:44] !log Deploying mediawiki: Add missing controls for php-fpm - T341320 [10:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:49] T341320: Wikimedia\RemexHtml\Tokenizer\TokenizerError: Wikimedia\RemexHtml\Tokenizer\Tokenizer: pcre.backtrack_limit exhausted - https://phabricator.wikimedia.org/T341320 [11:00:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:00:20] RECOVERY - PyBal backends health check on lvs2011 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:00:29] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:00:30] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:00:31] !log enabled puppet and pybal on lvs2011 (T344587) [11:00:37] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:48] RECOVERY - pybal on lvs2011 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:01:09] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:01:34] RECOVERY - PyBal connections to etcd on lvs2011 is OK: OK: 12 connections established with conf2004.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [11:02:36] (03PS3) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [11:05:22] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [11:08:54] (SystemdUnitFailed) firing: (3) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:32] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:09:32] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [11:09:47] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:09:47] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:09:52] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [11:09:52] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:10:04] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [11:10:04] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:10:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:10:55] !log eoghan@cumin2002 START - Cookbook sre.hosts.reboot-single for host phab2002.codfw.wmnet [11:11:08] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:11:09] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:12:29] 10SRE, 10Data-Platform-SRE, 10Discovery-Search: Unable to use kafka-topic.sh - Topic authorization failed - https://phabricator.wikimedia.org/T344989 (10pfischer) [11:12:50] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:12:51] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:13:10] (03CR) 10Physikerwelt: [C: 03+1] "> Noting also that this release bumps mathoid to node16 (see https://gerrit.wikimedia.org/r/c/mediawiki/services/mathoid/+/866666)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/919375 (owner: 10PipelineBot) [11:13:30] (03CR) 10Physikerwelt: [C: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/906694 (owner: 10PipelineBot) [11:13:34] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:13:35] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:14:03] (03CR) 10Physikerwelt: [C: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890357 (owner: 10PipelineBot) [11:14:59] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [11:15:07] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:15:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:15:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:15:37] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:16:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:16:25] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:16:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:16:52] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab2002.codfw.wmnet [11:17:05] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:17:06] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:17:28] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:21:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:24:18] !log eoghan@cumin2002 START - Cookbook sre.hosts.reboot-single for host phab-test1001.eqiad.wmnet [11:26:47] (03PS4) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [11:28:04] (03PS5) 10Hnowlan: helmfile: add entries and namespace for media-analytics service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951544 (https://phabricator.wikimedia.org/T336380) [11:29:10] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab-test1001.eqiad.wmnet [11:29:39] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bookworm [11:37:12] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [11:37:46] (03CR) 10Jbond: "LGTM a few suggestions and fix to the ci issue" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [11:38:10] !log jayme@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [11:39:34] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bookworm [11:45:08] (03CR) 10Klausman: "Reviewers note: please take extra care to review the computations, as I am still not 100% sure I got the SLO budget calculcation right." [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 (owner: 10Klausman) [11:49:05] (03CR) 10Klausman: "This change is ready for review." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951461 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [11:51:37] (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/951460 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [11:54:09] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [11:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [11:57:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: host reimage [11:58:29] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 (10kamila) [12:00:24] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) The errors are caused by T344991. Thus, the metrics produced by Benthos are not counting those requests. [12:01:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:03:39] (03PS4) 10Klausman: LiftWing: add latency/availability SLO dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/952226 [12:06:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:10:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [12:11:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:12:08] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1006.eqiad.wmnet [12:14:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet with OS bookworm [12:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:16:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [12:17:17] (03PS2) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [12:17:22] (03CR) 10Ayounsi: "Awesome, thanks a lot!" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [12:18:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1006.eqiad.wmnet [12:19:35] !log disable puppet fleet wide for reboots [12:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:48] (03PS1) 10Muehlenhoff: Update bastions for ssh-client-config and tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/952388 [12:20:50] !log jbond@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetmaster2001.codfw.wmnet [12:20:51] !log jbond@cumin1001 START - Cookbook sre.hosts.reboot-single for host puppetmaster1001.eqiad.wmnet [12:20:59] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/952389 [12:21:45] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:24:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Update bastions for ssh-client-config and tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/952388 (owner: 10Muehlenhoff) [12:25:08] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/952389 (owner: 10Muehlenhoff) [12:25:26] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently disabled (reboot puppet infrastructre), not alerting. Last run 2 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:26:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:26:45] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:27:14] !log jbond@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetmaster2001.codfw.wmnet [12:27:53] (03Abandoned) 10JHathaway: DO NOT MERGE: Remove hostname from ssh known_hosts aliases [puppet] - 10https://gerrit.wikimedia.org/r/941543 (owner: 10JHathaway) [12:29:32] PROBLEM - Host puppetmaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:48] PROBLEM - Host puppetmaster1004 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:12] this is me, please ignore ^^ [12:30:16] RECOVERY - Host puppetmaster1003 is UP: PING OK - Packet loss = 0%, RTA = 3.45 ms [12:30:20] PROBLEM - Host puppetmaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:32] PROBLEM - Host puppetmaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:30:56] RECOVERY - Host puppetmaster1004 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [12:30:56] RECOVERY - Host puppetmaster2004 is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [12:31:16] RECOVERY - Host puppetmaster2003 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [12:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:32:55] !log imported wmf-laptop 0.5.8 to apt.wikimedia.org [12:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 (10kamila) [12:37:21] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetmaster1001.eqiad.wmnet [12:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:46:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [12:49:38] (03PS3) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [12:53:17] (03PS1) 10Ayounsi: Add mock secret for the gnmi telemetry user [labs/private] - 10https://gerrit.wikimedia.org/r/952398 (https://phabricator.wikimedia.org/T326322) [12:54:18] (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/952398 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [12:55:23] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add mock secret for the gnmi telemetry user [labs/private] - 10https://gerrit.wikimedia.org/r/952398 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [12:56:08] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:56:42] (03PS4) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [12:59:02] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:00:24] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:48] (03PS1) 10Jbond: kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 [13:11:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:13:11] (03CR) 10CI reject: [V: 04-1] kernel_report: small script to generate reboots task [puppet] - 10https://gerrit.wikimedia.org/r/952401 (owner: 10Jbond) [13:13:33] (03CR) 10Jbond: WIP: add gNMI (+cert) check for network devices (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/948553 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:16:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10jbond) [13:21:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) cloudelastic1007 A2 U26 cloudelastic1008. B2. U25 cloudelastic1009. C2. U27 cloudelastic1010. D2. U36 [13:21:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) a:03Jclark-ctr [13:22:28] (03CR) 10FNegri: [C: 03+1] "This can be merged I think? It cleans up the current group memberships, and the only side effect is some extra permissions for taavi and s" [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [13:23:52] (03PS1) 10Slyngshede: Allow Unix shell account to be specified. [software/bitu] - 10https://gerrit.wikimedia.org/r/952402 [13:24:56] (03CR) 10Muehlenhoff: idp: add datacenter-ops to puppetboard (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [13:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:28:06] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:29:45] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:31:43] (03CR) 10Jbond: Add gNMI based telemetry collection using gNMIc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:32:26] (03CR) 10Jbond: "lgtm apart from the wrong type" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:32:46] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:33:39] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [13:37:02] 10SRE, 10ops-codfw, 10serviceops: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 (10Jhancock.wm) [13:37:25] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[34] - https://phabricator.wikimedia.org/T344597 (10Jhancock.wm) 05Open→03Resolved [13:38:02] (03PS5) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [13:38:44] (03CR) 10Ayounsi: Add gNMI based telemetry collection using gNMIc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:38:52] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [13:40:27] !log installing w3m security updates [13:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:28] (03PS1) 10FNegri: cluster::cloud_management allow access to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) [13:44:13] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, and 2 others: cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) I created a patch to propose a specific solution for this task: https://gerrit.wikimedia.org/r/952448 Before m... [13:45:38] (03CR) 10Jbond: [C: 03+2] admin: add wmcs-roots to wmcs-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923684 (owner: 10Jbond) [13:47:24] (03PS1) 10Ssingh: cp/esams: unify cp hieradata overrides [puppet] - 10https://gerrit.wikimedia.org/r/952449 (https://phabricator.wikimedia.org/T344174) [13:47:46] (03CR) 10Jbond: [C: 03+2] admin: deprecate labtest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [13:47:56] (03PS5) 10Jbond: admin: deprecate labtest-roots group [puppet] - 10https://gerrit.wikimedia.org/r/951469 (https://phabricator.wikimedia.org/T337848) [13:51:09] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Apache logs get split across packets in MW-on-K8s - https://phabricator.wikimedia.org/T344991 (10kamila) a:03kamila [13:57:24] 10SRE, 10Math, 10RESTBase Sunsetting, 10Traffic: Determin the cause of x8 increase in requests to math endpoints between july 6 and August 3 - https://phabricator.wikimedia.org/T344329 (10Physikerwelt) p:05Triage→03Low I can't explain this. However, I think it is less critical. Previously, we had thr... [13:58:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) cloudcontrol1008 D5 U38 cloudcontrol1009. E4 U40 cloudcontrol1010. F4. U40 cloudnet1007. E4. U39 cloudnet1008... [13:59:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) a:03Jclark-ctr [14:01:08] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/952449/43026/" [puppet] - 10https://gerrit.wikimedia.org/r/952449 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:08:53] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:58] 10SRE, 10Math, 10RESTBase Sunsetting, 10Traffic: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329 (10Aklapper) [14:13:41] (03CR) 10Vgutierrez: [C: 03+1] cp/esams: unify cp hieradata overrides [puppet] - 10https://gerrit.wikimedia.org/r/952449 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:14:28] (03CR) 10Jbond: [C: 04-2] "see inline unfortunately validate_cmd won't work in this case. i have done a -2 as im not sure there is a way to get this working but if " [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [14:15:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [14:18:22] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:07] (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:26] PROBLEM - Host kubernetes2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:32] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:21:46] ^ anyone working on this? [14:24:27] (03CR) 10Ssingh: dnsrecursor: use validate_cmd for pdns-recursor config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [14:24:34] (KubernetesCalicoDown) firing: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:50] hmm ok [14:25:04] the last entry is 2023-20-21 [14:25:06] er, 02 [14:27:26] RECOVERY - Host kubernetes2009 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [14:27:28] sukhe: looking at kubernetes2009 [14:27:35] thank you [14:27:42] can't see anything in the getsel logs [14:28:56] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:29:10] me neither [14:29:34] (KubernetesCalicoDown) resolved: kubernetes2009.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2009.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:30:01] console hangs [14:30:05] I'll powercycle [14:30:32] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:31:40] PROBLEM - Host kubernetes2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:54] !log powercycled kubernetes2009 [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:25] (03PS1) 10Jbond: dbproxy: change ownership to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) [14:33:34] RECOVERY - Host kubernetes2009 is UP: PING OK - Packet loss = 0%, RTA = 31.65 ms [14:33:58] (03CR) 10Hashar: [C: 03+1] profile::ci::package_builder::extra_packages: Remove Stretch support [puppet] - 10https://gerrit.wikimedia.org/r/952302 (owner: 10Muehlenhoff) [14:34:32] (03CR) 10Hashar: [C: 03+1] "And I have confirmed modules/package_builder/manifests/environments.pp no more refers to stretch:" [puppet] - 10https://gerrit.wikimedia.org/r/952302 (owner: 10Muehlenhoff) [14:36:01] claime: definitely worth looking up here what went wrong, can't see anything obvious [14:36:10] yeah [14:36:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:36:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2048.codfw.wmnet with OS bullseye [14:37:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye [14:37:04] Aug 25 14:30:27 kubernetes2009 systemd[1]: Closed LVM2 poll daemon socket. [14:37:06] ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ [14:37:08] informative logs [14:37:44] (03CR) 10Marostegui: [C: 03+1] dbproxy: change ownership to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [14:39:54] (03PS1) 10Muehlenhoff: gerrit : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952457 [14:41:16] (03PS1) 10Muehlenhoff: bastion: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952459 [14:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:44:34] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:41] (03PS12) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) [14:45:26] (03PS1) 10Muehlenhoff: aphlict : Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952461 [14:46:19] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952457 (owner: 10Muehlenhoff) [14:46:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [14:46:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952461 (owner: 10Muehlenhoff) [14:47:09] (03PS2) 10FNegri: cluster::cloud_management allow access to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) [14:50:05] (03CR) 10Ssingh: [C: 03+2] cp/esams: unify cp hieradata overrides [puppet] - 10https://gerrit.wikimedia.org/r/952449 (https://phabricator.wikimedia.org/T344174) (owner: 10Ssingh) [14:51:50] (03CR) 10Cathal Mooney: Modify install and apt server config to support Juniper ZTP via HTTP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [14:52:16] !log force run agent on A:cp-esams [14:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:41] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2048.codfw.wmnet with reason: host reimage [15:01:42] (03CR) 10Jbond: [C: 04-2] "tried to add a bit more context but feel free to ping on irc" [puppet] - 10https://gerrit.wikimedia.org/r/937139 (owner: 10Ssingh) [15:01:44] (03CR) 10Btullis: "I wonder if it's worth adding Data Engineering here too, or Data Platform SRE (which doesn't yet exist as a contact). I know that the owne" [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [15:02:15] (03CR) 10Btullis: "Ref: https://phabricator.wikimedia.org/T342578" [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [15:02:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2048.codfw.wmnet with reason: host reimage [15:08:54] (SystemdUnitFailed) firing: (2) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:33] (03CR) 10Jbond: dbproxy: change ownership to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [15:14:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [15:17:16] (03PS2) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045 [15:17:41] (03CR) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [15:17:58] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [15:18:09] (03PS1) 10Jbond: admin: update ssh key for RMaung [puppet] - 10https://gerrit.wikimedia.org/r/952467 (https://phabricator.wikimedia.org/T330335) [15:18:23] (03PS6) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [15:18:36] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [15:19:08] (03CR) 10Jbond: [C: 03+2] admin: update ssh key for RMaung [puppet] - 10https://gerrit.wikimedia.org/r/952467 (https://phabricator.wikimedia.org/T330335) (owner: 10Jbond) [15:19:35] (03CR) 10CI reject: [V: 04-1] multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045 (owner: 10Gergő Tisza) [15:23:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10jbond) 05Open→03Resolved a:05JMeybohm→03None The new key has been merged and should be rolled out in 30 minutes [15:24:06] (03PS7) 10Ayounsi: Add gNMI based telemetry collection using gNMIc [puppet] - 10https://gerrit.wikimedia.org/r/952325 (https://phabricator.wikimedia.org/T326322) [15:24:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (10jbond) a:03JMeybohm [15:26:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [15:31:15] 10SRE, 10SRE-Access-Requests: MediaWiki deployment shell access request for Kizule (aka Zoranzoki21) - https://phabricator.wikimedia.org/T344887 (10jbond) @Kizule thanks for the request for production access please read and follow the process [[ https://wikitech.wikimedia.org/wiki/SRE/Production_access | highl... [15:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [15:51:08] (03CR) 10Nskaggs: dbproxy: change ownership to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [15:53:50] (03PS3) 10Gergő Tisza: multi-dc: Fix central autologin URL pattern [puppet] - 10https://gerrit.wikimedia.org/r/952045 [15:54:41] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) @MatthewVernon just want to verify names. we are listing two servers in racking task but only had one ordered? [15:57:09] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be100[34] - https://phabricator.wikimedia.org/T342675 (10Jclark-ctr) [15:58:01] (03CR) 10Cathal Mooney: [C: 03+2] Modify install and apt server config to support Juniper ZTP via HTTP [puppet] - 10https://gerrit.wikimedia.org/r/942682 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [16:07:17] (03PS1) 10Cathal Mooney: Fix typo in reference to new ztp_juniper class for aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/952473 (https://phabricator.wikimedia.org/T336485) [16:08:36] (03CR) 10Jbond: dbproxy: change ownership to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [16:09:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952473 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [16:09:51] (03CR) 10Cathal Mooney: [C: 03+2] Fix typo in reference to new ztp_juniper class for aptrepo [puppet] - 10https://gerrit.wikimedia.org/r/952473 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [16:10:13] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:12:22] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host moss-be2003 to CODFW - jhancock@cumin2002" [16:14:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host moss-be2003 to CODFW - jhancock@cumin2002" [16:14:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:59] (SystemdUnitFailed) firing: (3) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:21:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:22:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:23:29] (03PS1) 10Btullis: Add .bash_aliases file for btullis [puppet] - 10https://gerrit.wikimedia.org/r/952475 [16:23:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:27:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:27:38] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2048.codfw.wmnet with OS bullseye [16:27:44] (ThanosQueryInstantLatencyHigh) firing: (2) Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:27:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye completed: - kubernetes2048 (**WARN*... [16:31:15] (03CR) 10Ebernhardson: [C: 03+1] Disable search result deduplication. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952346 (https://phabricator.wikimedia.org/T341227) (owner: 10Peter Fischer) [16:31:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:31:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Papaul) [16:33:32] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Jhancock.wm) [16:33:42] (03PS1) 10Btullis: Build production-images based on spark 3.3.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) [16:34:39] (03PS2) 10Btullis: Build production-images based on spark 3.3.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) [16:37:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [16:38:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:38:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:39:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:41:10] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [16:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:41:35] !log cmooney@cumin1001 START - Cookbook sre.network.provision for device ssw1-a1-codfw.mgmt.codfw.wmnet [16:41:36] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:42:55] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:43:50] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:44:10] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [16:46:10] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [16:46:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:46:47] (03CR) 10Btullis: "At some point, it might be useful for us to change the layout so that we can build multiple versions of spark concurrently, as we do with " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [16:48:07] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:49:10] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [16:51:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:55:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:56:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [16:57:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [16:59:09] 10SRE, 10Observability-Metrics, 10observability: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220 (10CDanis) So, a few notes: * We no longer have any Dell R320s in production -- netbox reports 0 instances. https://netbox.wikimedia.org/dcim/device-types/37/ * I man... [17:02:39] (03PS1) 10Cathal Mooney: Allow MGMT_NETWORKS connect to apt server private server on 8080 [puppet] - 10https://gerrit.wikimedia.org/r/952478 (https://phabricator.wikimedia.org/T336485) [17:03:45] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [17:03:54] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1009.eqiad.wmnet with OS bullseye [17:04:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [17:05:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [17:11:29] (03CR) 10Jbond: [C: 03+1] "fyi feel free to +2 and self merge changes to your user files" [puppet] - 10https://gerrit.wikimedia.org/r/952475 (owner: 10Btullis) [17:12:56] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [17:13:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/948540 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [17:13:18] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [17:13:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['moss-be2003'] [17:13:55] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/948538 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [17:13:57] (03CR) 10Majavah: [C: 03+1] wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: 10Ssingh) [17:14:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/948535 (https://phabricator.wikimedia.org/T326322) (owner: 10Ayounsi) [17:14:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [17:15:02] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['moss-be2003'] [17:15:04] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/948113 (https://phabricator.wikimedia.org/T335027) (owner: 10Ayounsi) [17:15:14] (03CR) 10Cathal Mooney: [C: 03+1] Enable sftp-server [homer/public] - 10https://gerrit.wikimedia.org/r/947715 (https://phabricator.wikimedia.org/T316544) (owner: 10Ayounsi) [17:15:38] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [17:16:10] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1009.eqiad.wmnet with reason: host reimage [17:18:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:18:46] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1009.eqiad.wmnet with reason: host reimage [17:26:08] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:28:23] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:28:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:29] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [17:29:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:29:57] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.394 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:30:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove management record for ssw1-a1-codfw - cmooney@cumin1001" [17:30:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:30:15] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.network.provision (exit_code=99) for device ssw1-a1-codfw.mgmt.codfw.wmnet [17:36:51] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1009.eqiad.wmnet with OS bullseye [17:39:09] !log bking@deploy1002 Started deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124 [17:39:14] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [17:39:29] !log bking@deploy1002 Finished deploy [wdqs/wdqs@16e3dcf]: push deploy after bullseye reimage T343124 (duration: 00m 19s) [17:41:31] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:41:32] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [17:43:32] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [17:43:36] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [17:44:02] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: sync [17:44:28] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: sync [17:44:46] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: sync [17:45:25] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: sync [17:45:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['moss-be2003'] [17:45:31] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: sync [17:45:49] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [17:46:07] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: sync [17:46:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:46:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['moss-be2003'] [17:47:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['moss-be2003'] [17:51:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [17:54:47] (03CR) 10Xcollazo: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/952476 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [17:59:00] (03CR) 10Jforrester: "This potentially has broken the Wikifunctions evaluator service; it suddenly went down, and this is the only patch to that chart for weeks" [deployment-charts] - 10https://gerrit.wikimedia.org/r/950188 (https://phabricator.wikimedia.org/T344177) (owner: 10JMeybohm) [18:01:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [18:01:57] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [18:07:31] (03PS1) 10Jforrester: Revert "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) [18:07:58] (03PS1) 10Jforrester: wikifunctions: Bump evaluator to 2023-08-08-220047 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952484 (https://phabricator.wikimedia.org/T344998) [18:08:37] (03CR) 10Jforrester: Revert "wikifunctions: Fix networkpolicies" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:09:43] (03CR) 10Jforrester: [C: 03+2] "Let's try this, for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/952484 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:10:35] (03Merged) 10jenkins-bot: wikifunctions: Bump evaluator to 2023-08-08-220047 [deployment-charts] - 10https://gerrit.wikimedia.org/r/952484 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:12:05] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:12:07] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:12:47] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:13:16] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:13:24] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:14:14] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:14:17] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:15:10] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:16:27] 10SRE, 10Infrastructure-Foundations, 10Traffic: Reimaging lvs2012 fails as the host is unreachable from cumin2002 - https://phabricator.wikimedia.org/T336428 (10BCornwall) 05Open→03Resolved a:03BCornwall cumin2002 is able to ping both v4 and v6, so I'm going to mark as closed. Thanks, everyone! [18:18:54] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:19:36] (03PS2) 10Jforrester: Revert "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) [18:19:44] (03CR) 10Jforrester: Revert "wikifunctions: Fix networkpolicies" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:20:34] (03Abandoned) 10Jforrester: Add AppArmor configuration for the deployed function-evaluator service. [deployment-charts] - 10https://gerrit.wikimedia.org/r/936316 (https://phabricator.wikimedia.org/T326785) (owner: 10Cory Massaro) [18:20:40] (03CR) 10Jforrester: [C: 03+2] Revert "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:21:31] (03Merged) 10jenkins-bot: Revert "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952434 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:22:28] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:22:30] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:23:40] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:23:43] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:25:39] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:26:07] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:34:44] (03PS1) 10Jforrester: wikifunctions: Add production alerting alongside beta [puppet] - 10https://gerrit.wikimedia.org/r/952486 [18:35:18] (03PS1) 10Jforrester: Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952435 (https://phabricator.wikimedia.org/T344998) [18:36:10] (03PS2) 10Jforrester: Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952435 (https://phabricator.wikimedia.org/T344998) [18:39:44] (03CR) 10Jforrester: [C: 03+2] Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952435 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:42:15] (03Merged) 10jenkins-bot: Revert "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952435 (https://phabricator.wikimedia.org/T344998) (owner: 10Jforrester) [18:44:16] (03PS1) 10Jforrester: wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 [18:45:08] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [18:45:10] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [18:45:31] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [18:45:49] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host moss-be2003.mgmt.codfw.wmnet with reboot policy FORCED [18:46:10] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [18:47:13] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [18:47:15] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [18:48:12] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [18:48:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:50:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.013 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:52:45] (03PS1) 10Jforrester: Re-apply "wikifunctions: Fix networkpolicies" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952436 (https://phabricator.wikimedia.org/T344177) [18:53:18] (03PS1) 10Jforrester: Re-apply "admin_ng: Disable GlobalNetworkPolicy allow rules for wikifunctions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/952437 (https://phabricator.wikimedia.org/T344177) [18:55:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bullseye [18:55:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye [18:59:05] (03PS2) 10Jforrester: wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) [18:59:46] 10SRE, 10Abstract Wikipedia team, 10Wikifunctions, 10serviceops, and 2 others: Wikifunctions functions that call the evaluator are all getting no response, UX instead showing 'http' - https://phabricator.wikimedia.org/T344998 (10Jdforrester-WMF) Unfortunately at this point I'm out of ideas as to what's cau... [19:01:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:08:29] ^ we have had persistent redis alerts from a few days [19:08:37] anyone looked at them? [19:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:19:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:31:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:12] sukhe: yeah ako.siaris had looked into it I think yesterday, there is some info in the backscroll of #-serviceops [19:36:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:37:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:39:05] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [19:39:59] (SystemdUnitFailed) firing: (8) ipmiseld.service Failed on wdqs1005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:41:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [19:42:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:45:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:47:17] thanks herron. [19:48:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be2003.codfw.wmnet with OS bullseye [19:48:18] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host moss-be2003.codfw.wmnet with OS bullseye executed with... [19:50:34] (03PS2) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [19:51:27] (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [19:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:27:56] (03PS3) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [20:28:51] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:31:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:31:46] (03PS1) 10Papaul: Add moss-be2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/952513 (https://phabricator.wikimedia.org/T342674) [20:32:29] (03CR) 10CI reject: [V: 04-1] Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [20:32:57] (03CR) 10Papaul: [C: 03+2] Add moss-be2003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/952513 (https://phabricator.wikimedia.org/T342674) (owner: 10Papaul) [20:35:17] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, and 2 others: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Papaul) [20:36:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [20:40:37] (03CR) 10Bking: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/937535 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [20:41:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:01:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:02:31] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on wdqs1005.eqiad.wmnet with reason: to be decommissioned soon [21:02:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on wdqs1005.eqiad.wmnet with reason: to be decommissioned soon [21:03:47] !log bking@cumin1001 shutting off wdqs1005 in preparation for decommission T344198 [21:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:52] T344198: Decommission wdqs10[03-05] - https://phabricator.wikimedia.org/T344198 [21:15:33] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:20:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [21:25:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:25:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [21:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [21:30:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:41:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:01:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:06:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:16:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:18:54] (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:31:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:36:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:46:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:51:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:56:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:06:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:11:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:21:28] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:56:28] (RedisMemoryFull) firing: (3) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull