[00:02:02] PROBLEM - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [00:02:02] PROBLEM - Host ps1-f8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [00:03:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) [00:08:54] (03PS2) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [00:08:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) I setup the last 8 PDU's row E [4-8] and row F [4-8] but was not able to setup ps1-e8 ad ps1-f8 because of connectivity issue. I added those to librenms and icinga. so... [00:10:06] (03CR) 10CI reject: [V: 04-1] team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [00:24:57] (03PS1) 10Dzahn: systemd: allow service names to end in .timer [puppet] - 10https://gerrit.wikimedia.org/r/902520 [00:26:24] (03CR) 10Dzahn: Disable the package installed systemd timer for logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [00:32:45] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [00:32:47] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [00:38:44] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [00:38:46] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [00:45:20] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:40] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:12] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:36] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:16:36] * Krinkle testing on mwdebug1001 [02:23:52] (03PS1) 10Krinkle: Profiler: Implement "Excimer UI" option for WikimediaDebug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902529 (https://phabricator.wikimedia.org/T291015) [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:02:34] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:03:14] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:20:22] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [04:28:22] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [05:13:49] (03PS11) 10Hashar: Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068) [05:15:40] (03CR) 10Hashar: [C: 03+2] wm-checks-api: support the Early Warning bot [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/894099 (https://phabricator.wikimedia.org/T330850) (owner: 10Hashar) [05:16:11] (03Merged) 10jenkins-bot: wm-checks-api: support the Early Warning bot [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/894099 (https://phabricator.wikimedia.org/T330850) (owner: 10Hashar) [05:16:43] (03CR) 10Hashar: [C: 03+2] Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [05:17:14] (03Merged) 10jenkins-bot: Display Zuul status of jobs for a change in Gerrit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859127 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [05:17:43] !log Restarting Gerrit for deploying plugins updates [05:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:53] !log hashar@deploy2002 Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) [05:21:00] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 07s) [05:21:00] T241068: Restrouter health checks fail when local wikifeeds instance is not pool in discovery records - https://phabricator.wikimedia.org/T241068 [05:21:01] T330850: [wm-checks-api] support EarlyWarningBot - https://phabricator.wikimedia.org/T330850 [05:22:20] !log Restarting gerrit replica on gerrit2002.wikimedia.org [05:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:58] !log hashar@deploy2002 Started deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) [05:26:04] T241068: Restrouter health checks fail when local wikifeeds instance is not pool in discovery records - https://phabricator.wikimedia.org/T241068 [05:26:04] T330850: [wm-checks-api] support EarlyWarningBot - https://phabricator.wikimedia.org/T330850 [05:26:08] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@c1cbda4]: Update js plugins for EarlyWarning bot (T330850) and displaying Zuul status on changes (T241068) (duration: 00m 10s) [05:26:22] !log Stopping Gerrit [05:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:50] !log Restarted Gerrit [05:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:25] !log gerrit: refreshed ssh host key for `github.com` [05:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230324T0600) [06:07:54] PROBLEM - Disk space on ms-be2067 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdy1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [06:11:04] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:28:22] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:29:33] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [06:33:22] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230324T0700) [07:03:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:08:22] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:13:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:14:33] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:18:22] (RedisMemoryFull) resolved: (2) Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [07:22:32] (03CR) 10Muehlenhoff: [C: 03+2] Add sre-admins as an NDA-relevant group [puppet] - 10https://gerrit.wikimedia.org/r/902423 (owner: 10Muehlenhoff) [07:27:53] (03PS2) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: fix margins in errorpage [puppet] - 10https://gerrit.wikimedia.org/r/902448 [07:27:55] (03PS2) 10Giuseppe Lavagetto: mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 [07:27:57] (03PS2) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) [07:30:28] (03CR) 10CI reject: [V: 04-1] mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto) [07:30:30] (03CR) 10CI reject: [V: 04-1] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [07:34:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::tlsproxy::envoy: fix margins in errorpage [puppet] - 10https://gerrit.wikimedia.org/r/902448 (owner: 10Giuseppe Lavagetto) [07:36:10] (03PS22) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [07:39:13] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [07:41:50] (03PS3) 10Giuseppe Lavagetto: mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 [07:41:52] (03PS3) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) [07:43:43] (03CR) 10CI reject: [V: 04-1] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [07:43:45] (03CR) 10CI reject: [V: 04-1] mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto) [07:46:28] (03PS23) 10Nicolas Fraison: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) [07:46:32] (03PS1) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) [07:46:39] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:46:46] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:50:49] (03PS4) 10Giuseppe Lavagetto: mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 [07:50:51] (03PS4) 10Giuseppe Lavagetto: mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) [07:52:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40307/console" [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [07:52:18] (03PS1) 10Marostegui: dbproxy1012,dbproxy1014: Test db1101 on proxies [puppet] - 10https://gerrit.wikimedia.org/r/902574 (https://phabricator.wikimedia.org/T331510) [07:52:45] (03CR) 10Marostegui: [C: 03+2] dbproxy1012,dbproxy1014: Test db1101 on proxies [puppet] - 10https://gerrit.wikimedia.org/r/902574 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:54:49] (03PS1) 10Marostegui: Revert "dbproxy1012,dbproxy1014: Test db1101 on proxies" [puppet] - 10https://gerrit.wikimedia.org/r/902592 [07:55:10] (03PS1) 10Nicolas Fraison: ceph::firewall: osd_cluster_addrs is not define [puppet] - 10https://gerrit.wikimedia.org/r/902575 [07:55:20] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1012,dbproxy1014: Test db1101 on proxies" [puppet] - 10https://gerrit.wikimedia.org/r/902592 (owner: 10Marostegui) [07:56:43] (03PS2) 10Nicolas Fraison: ceph::firewall: osd_cluster_addrs is not define [puppet] - 10https://gerrit.wikimedia.org/r/902575 [07:57:09] (03PS2) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) [07:57:52] (03CR) 10Slyngshede: Squid logformat to ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [08:09:52] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902576 [08:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:17:35] (03CR) 10Slyngshede: [C: 03+1] "From this weeks understanding of Logstash it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/901631 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [08:19:46] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/902576 (owner: 10Muehlenhoff) [08:26:30] 10SRE, 10Wikimedia-GitHub: Github RSA ssh host key updated 2023-03-23 - https://phabricator.wikimedia.org/T332972 (10hashar) [08:32:51] (03PS1) 10Hashar: gerrit: update Github RSA ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/902662 (https://phabricator.wikimedia.org/T332972) [08:33:59] 10SRE, 10Gerrit, 10Wikimedia-GitHub: Github RSA ssh host key updated 2023-03-23 - https://phabricator.wikimedia.org/T332972 (10hashar) [08:35:34] (03PS1) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [08:37:28] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [08:38:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] gerrit: update Github RSA ssh host key [puppet] - 10https://gerrit.wikimedia.org/r/902662 (https://phabricator.wikimedia.org/T332972) (owner: 10Hashar) [08:39:19] (03PS2) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [08:41:11] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [08:43:19] (03PS3) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [08:43:33] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup) [08:45:11] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [08:46:53] (03PS4) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [08:48:50] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [08:53:22] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:53:51] (03PS1) 10Muehlenhoff: Set profile::base::remove_python2_on_bullseye for the LVSes [puppet] - 10https://gerrit.wikimedia.org/r/902666 (https://phabricator.wikimedia.org/T321309) [08:54:33] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [08:56:49] (03CR) 10Filippo Giunchedi: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [08:57:21] !log Fixed up Gerrit > GitHub replication which broke at 5:00 UTC by updating the Github RSA ssh host key T332972 [08:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:28] T332972: Github RSA ssh host key updated 2023-03-23 - https://phabricator.wikimedia.org/T332972 [08:57:46] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40308/console" [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto) [08:58:00] 10SRE, 10Gerrit, 10Wikimedia-GitHub: Github RSA ssh host key updated 2023-03-23 - https://phabricator.wikimedia.org/T332972 (10hashar) 05Open→03Resolved a:03hashar At least for #Gerrit that is fixed. There might be other use cases though (TranslateWiki comes to mind) [08:58:08] (03PS5) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:00:04] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:07:15] (03PS6) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:09:05] (03CR) 10Filippo Giunchedi: "LGTM, see inline!" [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [09:09:07] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:10:47] (03PS7) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:12:41] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:13:12] (03CR) 10Filippo Giunchedi: "LGTM, modulo data-engineering folks' input on this" [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [09:14:33] (03PS8) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:16:18] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:16:41] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [09:17:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902666 (https://phabricator.wikimedia.org/T321309) (owner: 10Muehlenhoff) [09:18:17] (03PS9) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:20:09] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:25:04] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] remove flag for experimental mapdata geoshape expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [09:26:04] (03PS10) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:27:53] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:31:46] (03PS11) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:33:45] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:34:07] 10SRE, 10DNS, 10Stewards-and-global-tools, 10Traffic, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu) [09:34:28] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40309/console" [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [09:35:40] (03PS8) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [09:36:59] (03PS12) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:37:06] 10SRE, 10DNS, 10Stewards-and-global-tools, 10Traffic, 10Wikimedia-Hackathon-2023: Wikimedia + DNS issues/ideas mapping *(Rotterdam+Athens+online) - https://phabricator.wikimedia.org/T332971 (10Vituzzu) [09:37:29] (03CR) 10CI reject: [V: 04-1] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [09:38:28] (03CR) 10Cathal Mooney: "Thanks Fillipo. Ben could I ask you to look at the below? I believe the results should be the same for the first 3. Only doing the crit" [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [09:38:51] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:39:28] (03Abandoned) 10Giuseppe Lavagetto: modules: re-add base.kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/901768 (owner: 10Giuseppe Lavagetto) [09:39:51] (03Abandoned) 10Giuseppe Lavagetto: tegola-vector-tiles: update to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901767 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [09:40:35] (03PS13) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:40:50] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10Joe) 05Ope... [09:40:55] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10Joe) [09:41:35] (03PS9) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [09:42:23] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:42:57] (03PS3) 10Filippo Giunchedi: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:44:04] (03CR) 10Filippo Giunchedi: "LGTM overall, I took the liberty of simplifying the expression and fixing the tests." [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [09:44:20] (03PS14) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:46:13] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:47:01] (03PS1) 10Jcrespo: backup_update: Fix logging on successful file history update [software/mediabackups] - 10https://gerrit.wikimedia.org/r/902672 [09:47:30] (03PS15) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:49:22] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:51:31] (03PS16) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [09:53:22] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [09:56:29] 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10serviceops-radar, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10Joe) On further thoughts: * `c... [09:57:13] (03PS1) 10Marostegui: db1204: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/902676 (https://phabricator.wikimedia.org/T330861) [09:57:16] jynus: ^ [09:58:04] (03CR) 10Jcrespo: [C: 03+1] db1204: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/902676 (https://phabricator.wikimedia.org/T330861) (owner: 10Marostegui) [09:58:16] (03CR) 10Marostegui: [C: 03+2] db1204: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/902676 (https://phabricator.wikimedia.org/T330861) (owner: 10Marostegui) [09:58:27] let me check the other patch too, I was busy before [09:58:36] no rush with that one! [10:00:25] !log Upgrade db1204 to mariadb 10.6 T330861 [10:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:31] T330861: Migrate backup1-* masters to MariaDB 10.6 - https://phabricator.wikimedia.org/T330861 [10:00:58] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] remove flag for experimental mapdata geoshape expansion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902670 (https://phabricator.wikimedia.org/T332973) (owner: 10Awight) [10:01:31] (03PS5) 10Clément Goubert: P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) [10:01:52] 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10serviceops-radar, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10Joe) a:03Joe [10:02:18] (03CR) 10Btullis: [C: 03+1] "Super-nice. Many thanks Cathal and Filippo." [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [10:05:46] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) I had a closer look at this while preparing a separate csqlsh package (which I think we can avoid after all): - The Cassandra deb itself includes csqlsh and the underlying Pyt... [10:05:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST csinodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:06:31] (03PS17) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [10:08:23] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [10:10:58] (KubernetesAPILatency) resolved: (20) High Kubernetes API latency (LIST authorizationpolicies) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:11:23] (03PS4) 10Jbond: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [10:12:43] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/902520 (owner: 10Dzahn) [10:18:50] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) The plan looks good, even if I am a little sad that we need to use the python2 hack on cassandra nodes. The `cassandra-tools-wmf` is already available for bullseye/python3: ` cassandra-... [10:26:59] (03PS1) 10Elukey: ml-services: update docker image for draft topic model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902678 (https://phabricator.wikimedia.org/T328576) [10:27:25] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency - https://phabricator.wikimedia.org/T332927 (10Aklapper) 05Open→03Invalid Hi, administrators' `/settings/digest` offers `Digest Volume Frequency` and `Digest size threshold` settings. These are list settings and these are not user account... [10:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:28] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/902575 (owner: 10Nicolas Fraison) [10:31:04] (03PS1) 10Muehlenhoff: Cassandra: Install python-is-python2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902679 (https://phabricator.wikimedia.org/T310980) [10:32:00] (03CR) 10Elukey: [C: 03+1] Cassandra: Install python-is-python2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902679 (https://phabricator.wikimedia.org/T310980) (owner: 10Muehlenhoff) [10:32:38] (03PS10) 10Cathal Mooney: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) [10:32:41] (03PS1) 10Filippo Giunchedi: Test for referenced rule_files [alerts] - 10https://gerrit.wikimedia.org/r/902681 [10:32:50] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for draft topic model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902678 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:32:52] (03PS5) 10Jbond: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [10:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:43] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency - https://phabricator.wikimedia.org/T332927 (10Novem_Linguae) @Aklapper . Thanks. Any idea who the administrator of wikimedia-l is? I'm not seeing it in the obvious places (https://lists.wikimedia.org/postorius/lists/wikimedia-l.lists.wikimed... [10:34:49] (03PS6) 10Jbond: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [10:34:53] (03PS1) 10Jbond: systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 [10:35:43] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:35:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40312/console" [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:36:19] (03CR) 10CI reject: [V: 04-1] Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [10:36:25] (03CR) 10CI reject: [V: 04-1] systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:37:09] (03PS2) 10Jbond: systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 [10:38:47] (03CR) 10CI reject: [V: 04-1] systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:39:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40313/console" [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:41:08] (03PS3) 10Jbond: systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 [10:42:43] (03CR) 10CI reject: [V: 04-1] systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:45:02] (03PS1) 10Elukey: services: update lift wing config for changeprop's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/902684 (https://phabricator.wikimedia.org/T328576) [10:47:00] (03CR) 10Muehlenhoff: [C: 03+2] Cassandra: Install python-is-python2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902679 (https://phabricator.wikimedia.org/T310980) (owner: 10Muehlenhoff) [10:47:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 38): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40314/console" [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [10:50:08] (03PS18) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [10:52:00] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [10:52:53] (03PS4) 10Jbond: systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 [10:53:16] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) 05Open→03Resolved {F36924942} We have an alert to catch the condition where a pod gets scheduled on a non-dedicated ho... [10:53:30] (03PS19) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [10:55:37] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [10:55:43] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 21 days, 0:00:00 on krb2002.codfw.wmnet with reason: Non-functional, WIP for Bullseye update [10:55:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 21 days, 0:00:00 on krb2002.codfw.wmnet with reason: Non-functional, WIP for Bullseye update [10:56:04] 10SRE, 10Infrastructure-Foundations: Migrate the KDCs to Bullseye - https://phabricator.wikimedia.org/T331695 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d3c0fbee-5db6-4389-b75e-415ed51c67bc) set by jmm@cumin2002 for 21 days, 0:00:00 on 1 host(s) and their services with reason: Non-func... [10:56:23] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency - https://phabricator.wikimedia.org/T332927 (10Aklapper) > Any idea who the administrator of wikimedia-l is? Ah, terminology mismatch on my side. :) "To contact the list owners, use the following email address: wikimedia-l-owner {at} lists.wi... [10:56:32] (03CR) 10Elukey: [C: 03+2] services: update lift wing config for changeprop's staging env [deployment-charts] - 10https://gerrit.wikimedia.org/r/902684 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [11:00:57] (03PS20) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:01:27] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync [11:01:40] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [11:02:47] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:03:15] (03CR) 10Jbond: "LGTM couple of nits" [alerts] - 10https://gerrit.wikimedia.org/r/902681 (owner: 10Filippo Giunchedi) [11:05:51] (03CR) 10Jbond: [C: 03+2] systemd: minor refactor [puppet] - 10https://gerrit.wikimedia.org/r/902682 (owner: 10Jbond) [11:05:59] (03PS1) 10Muehlenhoff: cassandra-dev: Enable Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902688 (https://phabricator.wikimedia.org/T310980) [11:07:40] (03CR) 10Jbond: Disable the package installed systemd timer for logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [11:07:48] (03PS7) 10Jbond: Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [11:08:51] (03PS21) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:10:47] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:12:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40315/console" [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [11:13:43] (03PS22) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:15:49] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:16:38] (03PS23) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:17:55] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [11:18:37] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:20:16] (03PS24) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:21:13] (03PS1) 10Filippo Giunchedi: sre: check pybal_monitor_down_results_total for PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/902690 (https://phabricator.wikimedia.org/T320627) [11:21:31] (03CR) 10Clément Goubert: [C: 03+1] sre: check pybal_monitor_down_results_total for PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/902690 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [11:22:05] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:22:42] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: check pybal_monitor_down_results_total for PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/902690 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [11:23:56] (03Merged) 10jenkins-bot: sre: check pybal_monitor_down_results_total for PybalBackendDown [alerts] - 10https://gerrit.wikimedia.org/r/902690 (https://phabricator.wikimedia.org/T320627) (owner: 10Filippo Giunchedi) [11:24:01] (03CR) 10Btullis: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [11:25:29] (03PS4) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [11:27:57] (03PS25) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:28:57] (03PS5) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [11:29:17] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [11:29:46] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:31:21] (03PS6) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [11:33:04] (03PS26) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:33:15] (03CR) 10Jbond: team-sre/resource: Add disk space (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:34:17] (03PS2) 10Filippo Giunchedi: Test for referenced rule_files [alerts] - 10https://gerrit.wikimedia.org/r/902681 [11:34:29] (03CR) 10Filippo Giunchedi: "Thank you for the review" [alerts] - 10https://gerrit.wikimedia.org/r/902681 (owner: 10Filippo Giunchedi) [11:34:54] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:37:22] (03PS4) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) [11:37:34] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [11:40:57] (03CR) 10Muehlenhoff: apache2: Use systemd provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [11:42:18] RECOVERY - Disk space on ms-be2067 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2067&var-datasource=codfw+prometheus/ops [11:43:56] (03CR) 10Muehlenhoff: "Looks good, some nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [11:44:39] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be2067.codfw.wmnet [11:45:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I don't know if there are any non-backwards-compatible changes between 43 and 50, but we can just go ahead and fix up as neede" [puppet] - 10https://gerrit.wikimedia.org/r/902479 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [11:45:56] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ms-be2067.codfw.wmnet [11:50:26] (03PS5) 10Jbond: alertmanager: also pages to sre for data-engineering, releng and search [puppet] - 10https://gerrit.wikimedia.org/r/902431 (https://phabricator.wikimedia.org/T332709) [11:50:28] (03PS1) 10Jbond: alertmanager: Add Infrastructure Foundations routes [puppet] - 10https://gerrit.wikimedia.org/r/902691 (https://phabricator.wikimedia.org/T332709) [11:50:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:53:39] (03PS27) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [11:55:21] (03PS7) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [11:55:32] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [11:55:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:03:10] (03PS1) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [12:04:22] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) [12:04:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) p:05Triage→03High [12:04:46] (03CR) 10CI reject: [V: 04-1] Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [12:04:55] (03CR) 10Hnowlan: [C: 03+2] Remove code for supporting old Debian (<= 9.0) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [12:07:32] RECOVERY - Check systemd state on ms-be2067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:21] (03PS2) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [12:12:31] (03Merged) 10jenkins-bot: Remove code for supporting old Debian (<= 9.0) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/902321 (https://phabricator.wikimedia.org/T332548) (owner: 10Kamila Součková) [12:13:37] (03CR) 10CI reject: [V: 04-1] Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [12:18:52] (03PS3) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [12:20:42] (03CR) 10CI reject: [V: 04-1] Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [12:26:07] (03PS4) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [12:31:37] (03PS1) 10Jbond: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) [12:37:38] (03PS1) 10Jbond: P:contact: Add cluseter to role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/902702 (https://phabricator.wikimedia.org/T332709) [12:38:56] (03CR) 10Jbond: [C: 03+2] P:contact: Add cluseter to role_owner metric [puppet] - 10https://gerrit.wikimedia.org/r/902702 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [12:42:31] (03PS1) 10Cathal Mooney: Remove EventGate Icinga checks that have been moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) [12:46:50] (03PS28) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [12:46:53] (03CR) 10Jbond: [C: 03+2] "after deploying this i realised it is probably not so easy to map clusters to owners. e.g. misc has many different owners. however i thi" [puppet] - 10https://gerrit.wikimedia.org/r/902702 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [12:48:44] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [12:49:08] 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10larissagaulia) Hey @Dzahn thanks for helping us with getting Barakat the right accesses. Yes, `wmf` LDAP group should do. Regarding the end date: she's scheduled to be with us until July 2023, you c... [12:51:36] (03PS1) 10Muehlenhoff: Add Kamila to exception list, uses Yubikey-backed key [puppet] - 10https://gerrit.wikimedia.org/r/902704 [12:52:17] (03PS1) 10Hashar: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068) [12:52:44] (03PS29) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [12:54:34] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [12:54:50] (03PS2) 10Muehlenhoff: Add Kamila to exception list, uses Yubikey-backed key [puppet] - 10https://gerrit.wikimedia.org/r/902704 [12:55:33] (03PS30) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [12:57:17] (03CR) 10Muehlenhoff: [C: 03+2] Add Kamila to exception list, uses Yubikey-backed key [puppet] - 10https://gerrit.wikimedia.org/r/902704 (owner: 10Muehlenhoff) [12:57:24] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [12:57:38] (03CR) 10Filippo Giunchedi: "See inline" [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [13:00:06] (03CR) 10Filippo Giunchedi: "ok! probably fine too" [puppet] - 10https://gerrit.wikimedia.org/r/902702 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [13:04:17] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [13:04:30] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [13:07:50] (03PS31) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:08:53] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall though" [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [13:09:42] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:10:50] (03CR) 10Filippo Giunchedi: "Fine with me if IF folks are ok with it" [puppet] - 10https://gerrit.wikimedia.org/r/902691 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [13:10:57] (03PS32) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:12:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] systemd: allow service names to end in .timer (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902520 (owner: 10Dzahn) [13:12:48] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:14:20] (03PS33) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:16:09] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:17:14] (03PS1) 10Filippo Giunchedi: tests: replace open() with pathlib's read_text [alerts] - 10https://gerrit.wikimedia.org/r/902707 [13:17:42] (03PS34) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:19:35] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:21:32] (03PS35) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:21:42] (03PS3) 10Clément Goubert: trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto) [13:24:04] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:32:46] (03PS36) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:34:40] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:39:43] (03PS37) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:41:12] (03PS2) 10Filippo Giunchedi: tests: replace open() with pathlib's read_text [alerts] - 10https://gerrit.wikimedia.org/r/902707 [13:41:38] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:42:19] (03CR) 10EoghanGaffney: [C: 03+2] Disable the package installed systemd timer for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/902396 (owner: 10EoghanGaffney) [13:43:49] (03PS38) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:43:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:44] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:47:36] (03PS39) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:47:48] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:48:16] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:48:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:26] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:50:40] (03PS40) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [13:52:36] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [13:52:58] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:35] (03PS1) 10Filippo Giunchedi: test: check for external label references in non-global alerts [alerts] - 10https://gerrit.wikimedia.org/r/902711 [13:53:38] PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:09] (03PS8) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [13:57:21] (03CR) 10Jbond: "done" [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [13:59:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/902681 (owner: 10Filippo Giunchedi) [14:00:59] (03CR) 10Filippo Giunchedi: [C: 03+2] Test for referenced rule_files [alerts] - 10https://gerrit.wikimedia.org/r/902681 (owner: 10Filippo Giunchedi) [14:01:02] (03CR) 10Jbond: [C: 03+2] alertmanager: Add Infrastructure Foundations routes [puppet] - 10https://gerrit.wikimedia.org/r/902691 (https://phabricator.wikimedia.org/T332709) (owner: 10Jbond) [14:04:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:05:11] (03PS41) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [14:05:33] (03CR) 10Jbond: [C: 04-1] "FYI i created the following patch for this https://gerrit.wikimedia.org/r/c/operations/puppet/+/902682" [puppet] - 10https://gerrit.wikimedia.org/r/902520 (owner: 10Dzahn) [14:05:52] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:07] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [14:08:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/902707 (owner: 10Filippo Giunchedi) [14:08:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:09:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:37] (03CR) 10Eevans: [C: 03+1] cassandra-dev: Enable Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902688 (https://phabricator.wikimedia.org/T310980) (owner: 10Muehlenhoff) [14:10:43] (03CR) 10JHathaway: [C: 03+2] bookworm: use default mtail pkg [puppet] - 10https://gerrit.wikimedia.org/r/902479 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:12:01] (03CR) 10Muehlenhoff: [C: 03+2] cassandra-dev: Enable Python 2 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/902688 (https://phabricator.wikimedia.org/T310980) (owner: 10Muehlenhoff) [14:13:21] (03PS42) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [14:13:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/902711 (owner: 10Filippo Giunchedi) [14:14:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:15:33] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [14:16:33] (03PS43) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [14:17:25] (03PS2) 10JHathaway: bookworm: Update spamassassin daemon name [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) [14:17:56] PROBLEM - Check systemd state on aphlict2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:03] (03CR) 10Filippo Giunchedi: [C: 03+2] tests: replace open() with pathlib's read_text [alerts] - 10https://gerrit.wikimedia.org/r/902707 (owner: 10Filippo Giunchedi) [14:18:28] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [14:18:57] (03CR) 10JHathaway: bookworm: Update spamassassin daemon name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:19:29] (03PS44) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 [14:20:04] (03CR) 10Filippo Giunchedi: [C: 03+2] test: check for external label references in non-global alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902711 (owner: 10Filippo Giunchedi) [14:21:26] (03CR) 10CI reject: [V: 04-1] logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [14:22:48] (03Abandoned) 10DannyS712: Phabricator: add override for the browser time zone conflict message [puppet] - 10https://gerrit.wikimedia.org/r/718418 (https://phabricator.wikimedia.org/T158177) (owner: 10DannyS712) [14:23:56] (03CR) 10Btullis: "PCC still failing, unfortunately:" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [14:24:57] !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki wikimaniawiki "2024:Expressions of Interest" "Wikimania:Expressions of Interest" "Zabe" --reason "per request [[:phab:T332917|T332917]]" # T332917 [14:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:03] T332917: Move 2024:Expressions of Interest page with translations on wikimaniawiki - https://phabricator.wikimedia.org/T332917 [14:25:36] PROBLEM - Host scs-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:27:20] (03PS4) 10Elukey: role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) [14:29:44] (03PS3) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [14:30:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:31:02] (03PS5) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [14:31:32] RECOVERY - Host scs-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [14:32:01] (03CR) 10JHathaway: apache2: Use systemd provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:33:29] (03CR) 10Filippo Giunchedi: "LGTM! I'll let Ben vote though (virtual +1)" [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [14:33:56] (03CR) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [14:34:10] (03CR) 10Muehlenhoff: apache2: Use systemd provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:34:41] (03CR) 10Filippo Giunchedi: team-sre/resource: Add disk space (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [14:35:24] (03PS3) 10JHathaway: bookworm: Update spamassassin daemon name [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) [14:36:34] (03PS6) 10Cathal Mooney: Move Icinga eventgate logging external errors checks to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) [14:36:36] (03CR) 10JHathaway: apache2: Use systemd provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:39:21] (03CR) 10Cathal Mooney: [C: 03+2] Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [14:40:48] (03Merged) 10jenkins-bot: Move event logging checks from Icinga to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/902364 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney) [14:42:24] (03CR) 10JHathaway: [C: 03+2] bookworm: Update spamassassin daemon name [puppet] - 10https://gerrit.wikimedia.org/r/902496 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [14:48:59] (03PS1) 10Jbond: alertmanager: the catch all needs a matcher [puppet] - 10https://gerrit.wikimedia.org/r/902720 [14:49:02] (03PS1) 10Elukey: ml-services: update docker image for draft topic model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902721 (https://phabricator.wikimedia.org/T328576) [14:50:00] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: the catch all needs a matcher [puppet] - 10https://gerrit.wikimedia.org/r/902720 (owner: 10Jbond) [14:50:38] (03CR) 10Jbond: [C: 03+2] alertmanager: the catch all needs a matcher [puppet] - 10https://gerrit.wikimedia.org/r/902720 (owner: 10Jbond) [14:56:04] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for draft topic model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902721 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [14:59:03] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [14:59:43] (03PS1) 10Jbond: alertmanger: fix indentation for I/F team [puppet] - 10https://gerrit.wikimedia.org/r/902722 [15:00:40] (03PS2) 10Jbond: alertmanger: fix indentation for I/F team [puppet] - 10https://gerrit.wikimedia.org/r/902722 [15:02:17] (03PS1) 10Elukey: ml-services: update docker image for goodfaith model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902723 (https://phabricator.wikimedia.org/T328576) [15:04:07] (03PS3) 10Filippo Giunchedi: alertmanager: fix indentation for I/F team [puppet] - 10https://gerrit.wikimedia.org/r/902722 (owner: 10Jbond) [15:04:56] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: fix indentation for I/F team [puppet] - 10https://gerrit.wikimedia.org/r/902722 (owner: 10Jbond) [15:06:04] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: fix indentation for I/F team [puppet] - 10https://gerrit.wikimedia.org/r/902722 (owner: 10Jbond) [15:07:41] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) @Tgr I just wanted to followup on this in case either, a) we... [15:07:47] (03CR) 10Elukey: [C: 03+2] ml-services: update docker image for goodfaith model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/902723 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:09:36] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:09:45] (03PS9) 10Jbond: team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) [15:10:06] (03CR) 10Jbond: team-sre/resource: Add disk space (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [15:15:14] PROBLEM - Host scs-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [15:19:41] (03CR) 10Btullis: [C: 03+1] spark: authorize communication between executors on blockManager port [deployment-charts] - 10https://gerrit.wikimedia.org/r/902409 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [15:21:12] RECOVERY - Host scs-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [15:23:21] (03PS3) 10Hnowlan: admin: add user kamila [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková) [15:23:28] (03PS1) 10Elukey: services: enable (again) the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902725 (https://phabricator.wikimedia.org/T328576) [15:33:30] (03CR) 10Elukey: [C: 03+2] services: enable (again) the first lift wing stream in changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/902725 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [15:35:35] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:35:48] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:36:52] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:37:10] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:38:40] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10BCornwall) @Aklapper, I would agree to decline this but for the line mentioning that enwp.org is in widespread use. If it is (it'd be good to see some stats, @violetwtf!) then it might be worth accepting the donation... [15:39:32] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:39:42] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:43:16] (03CR) 10Nicolas Fraison: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [15:46:30] PROBLEM - eventgate-main validation error rate too high on alert1001 is CRITICAL: 2.278 gt 0.5 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [15:55:54] RECOVERY - eventgate-main validation error rate too high on alert1001 is OK: (C)0.5 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [15:56:00] this spike was my bad --^ [15:56:10] new lift wing stream, now it seems working [15:57:44] (03PS4) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [15:58:01] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [16:05:22] (03PS5) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [16:05:36] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [16:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:11:56] (03PS6) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [16:12:38] (03CR) 10Btullis: [C: 03+2] ceph::firewall: osd_cluster_addrs is not define [puppet] - 10https://gerrit.wikimedia.org/r/902575 (owner: 10Nicolas Fraison) [16:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:16:05] (03CR) 10Btullis: osd: Add osd on new ceph cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [16:17:12] (03PS24) 10Btullis: osd: Add osd on new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [16:18:00] (03Abandoned) 10Dzahn: systemd: allow service names to end in .timer [puppet] - 10https://gerrit.wikimedia.org/r/902520 (owner: 10Dzahn) [16:21:35] (03CR) 10Cathal Mooney: "LGTM! Not familiar with Swift at all, but all expressions make sense based on looking at the Icinga definitions. Few small questions, mo" [alerts] - 10https://gerrit.wikimedia.org/r/812883 (https://phabricator.wikimedia.org/T312765) (owner: 10Mark Bergsma) [16:33:38] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40319/console" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [16:34:22] (03PS1) 10Btullis: Add dummy keydata for the new ceph OSD services [labs/private] - 10https://gerrit.wikimedia.org/r/902752 (https://phabricator.wikimedia.org/T324660) [16:36:22] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy keydata for the new ceph OSD services [labs/private] - 10https://gerrit.wikimedia.org/r/902752 (https://phabricator.wikimedia.org/T324660) (owner: 10Btullis) [16:37:44] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40320/console" [puppet] - 10https://gerrit.wikimedia.org/r/896116 (https://phabricator.wikimedia.org/T330151) (owner: 10Nicolas Fraison) [16:48:03] (03CR) 10JHathaway: "pcc output, https://puppet-compiler.wmflabs.org/output/902501/40319/" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [16:50:03] (03PS1) 10Btullis: Add the ceph keys for the osds on the new ceph cluster [puppet] - 10https://gerrit.wikimedia.org/r/902753 (https://phabricator.wikimedia.org/T330151) [16:50:14] (03PS1) 10Jbond: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) [17:16:26] PROBLEM - Host scs-a8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [17:22:24] RECOVERY - Host scs-a8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [17:25:06] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) [17:26:15] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) Scs has been replaced / configured / and verified Rob and papaul could connect to it. My laptop is having issues in data center and is not connecting to it. unable to verify ports... [17:29:24] PROBLEM - PHP7 rendering on parse2017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1313 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [17:37:54] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency - https://phabricator.wikimedia.org/T332927 (10Risker) I am one of the list owners. Digest frequency has been set at "daily" since the earliest days of this mailing list. The next most frequent option is "weekly" and that's too long. The d... [17:40:59] hey, does anyone know why https://github.com/wikimedia/mediawiki-extensions-PageProperties is suddenly gone? [17:42:07] https://gerrit.wikimedia.org/g/mediawiki/extensions/PageProperties seems to still be there but the GitHub mirror seems to have vanished [17:42:51] cc _joe_ ? (this is probably not your department but not sure who to ask) [17:43:48] <_joe_> uh thcipriani ^^ [17:50:08] 10SRE: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 (10jbond) [17:50:36] 10SRE, 10observability, 10Patch-For-Review: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220 (10jbond) in T332764 we are looking at migrating checks from nagios to icinga and at the same time trying to understand if they are still valid or if they could be improved.... [17:50:38] huh [17:50:56] 10SRE, 10observability, 10Patch-For-Review: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220 (10jbond) 05Resolved→03Open p:05Triage→03Medium [17:51:11] thcipriani: I haven't seen this before but was just pulling the submodule and saw that the entire repo disappeared [17:54:17] hrm, must've been deleted on the github side some time after 5UTC today [17:55:11] at 5:29 replication started failing [17:55:29] now how to figure out how that happened :\ [17:55:44] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@822dfed]: dump discolytics to 0.10.0, and add transfer_to_es dag [17:55:51] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@822dfed]: dump discolytics to 0.10.0, and add transfer_to_es dag (duration: 00m 06s) [17:58:19] (03PS1) 10Jbond: base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) [17:58:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40321/console" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [17:59:09] (03PS2) 10Jbond: base::standard_packages: remove isc-dhcp-client [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) [18:00:03] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e3c41fb]: bump discolytics to 0.10.0, and add transfer_to_es dag [18:00:24] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e3c41fb]: bump discolytics to 0.10.0, and add transfer_to_es dag (duration: 00m 20s) [18:00:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40322/console" [puppet] - 10https://gerrit.wikimedia.org/r/902763 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [18:05:15] thcipriani: thanks for fixing so quick! [18:06:33] Reception123: heh, no problem. GitHub gave me a button I clicked :) good catch. Trying now to figure out why it was deleted... [18:07:12] yeah it does seem strange that it was just deleted out of the blue [18:07:13] (restoring was pretty easy: https://docs.github.com/en/repositories/creating-and-managing-repositories/restoring-a-deleted-repository ) [18:07:21] maybe human error :P [18:08:19] it might have something to do with "main" vs "master" and replication problems [18:08:59] but I'm unclear. I see that "main" was deleted on gerrit. And maybe this was meant to fix replication? Anyway, I'll watch replication logs and see what happens. [18:19:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:24:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:25:24] (03PS1) 10Jbond: team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) [18:26:37] (03CR) 10CI reject: [V: 04-1] team-sre/puppet-agent: Add alertmanager based check for disabled puppet [alerts] - 10https://gerrit.wikimedia.org/r/902764 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [18:30:39] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@220221d]: set start dates from transfer_to_es dags [18:30:55] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@220221d]: set start dates from transfer_to_es dags (duration: 00m 16s) [18:45:35] 10SRE, 10observability: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220 (10Dzahn) a:05Dzahn→03None [18:47:35] (03PS7) 10JHathaway: apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) [18:49:14] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40323/console" [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [18:50:36] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@fc69bf4]: Make mw rev recommendation create start_date configurable [18:50:49] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@fc69bf4]: Make mw rev recommendation create start_date configurable (duration: 00m 13s) [18:55:44] (03CR) 10JHathaway: [V: 03+1 C: 03+2] apache2: Use systemd provider [puppet] - 10https://gerrit.wikimedia.org/r/902501 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [18:55:50] 10SRE, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10Jclark-ctr) [18:55:52] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) [18:56:03] 10SRE, 10ops-eqiad: Q1 '19:(Need by: 2020-06-30) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Fixed laptop issue. Verified all ports working [18:56:20] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Jclark-ctr) [19:09:02] PROBLEM - ps1-f8-eqiad-infeed-load-tower-A-phase-X on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:02] PROBLEM - ps1-e8-eqiad-infeed-load-tower-B-phase-X on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:02] PROBLEM - ps1-e8-eqiad-infeed-load-tower-A-phase-X on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:02] PROBLEM - ps1-f8-eqiad-infeed-load-tower-B-phase-Z on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:36] PROBLEM - ps1-e8-eqiad-infeed-load-tower-B-phase-Y on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:36] PROBLEM - ps1-f8-eqiad-infeed-load-tower-A-phase-Y on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:36] PROBLEM - ps1-e8-eqiad-infeed-load-tower-A-phase-Y on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:52] PROBLEM - ps1-f8-eqiad-infeed-load-tower-A-phase-Z on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:52] PROBLEM - ps1-e8-eqiad-infeed-load-tower-B-phase-Z on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:10:12] PROBLEM - ps1-f8-eqiad-infeed-load-tower-B-phase-X on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:10:12] PROBLEM - ps1-e8-eqiad-infeed-load-tower-A-phase-Z on ps1-e8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:10:12] PROBLEM - ps1-f8-eqiad-infeed-load-tower-B-phase-Y on ps1-f8-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:59] (03PS8) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [19:30:20] (03PS1) 10JHathaway: mtail: Update defaults for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/902782 (https://phabricator.wikimedia.org/T331706) [19:32:45] (03PS1) 10Dzahn: etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) [19:34:17] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40324/console" [puppet] - 10https://gerrit.wikimedia.org/r/902782 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [19:36:37] (03PS1) 10Dzahn: releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) [19:37:25] (03CR) 10Dzahn: "see line 103 to 110 to compare against existing prometheus check" [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [19:37:30] (03CR) 10JHathaway: [V: 03+1 C: 03+2] mtail: Update defaults for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/902782 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [19:49:23] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 (10jhathaway) @Legoktm and @Ladsgroup I have setup a new host, lists1003.wikimedia.org, on bookworm. All the software is installed and most of the bookworm issues have... [19:53:04] (03PS1) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) [20:04:24] (03PS1) 10Dzahn: monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) [20:14:45] (JobUnavailable) firing: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:16:57] (03PS1) 10Dzahn: gerrit: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902799 (https://phabricator.wikimedia.org/T329587) [20:19:45] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:29:31] (03PS1) 10Dzahn: peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) [20:29:53] (03CR) 10CI reject: [V: 04-1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [20:30:53] (03PS2) 10Dzahn: peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) [20:31:15] (03CR) 10CI reject: [V: 04-1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [20:31:34] (03PS3) 10Dzahn: peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) [20:32:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:00] (03PS2) 10Dzahn: releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) [20:36:43] (03PS4) 10Dzahn: peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) [20:37:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:14] (03PS1) 10Dzahn: miscweb/static_rt: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) [20:51:14] (03PS2) 10Dzahn: miscweb/static_rt: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) [20:51:30] (03PS2) 10Andrea Denisse: doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) [20:51:59] (03CR) 10CI reject: [V: 04-1] doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [20:52:47] (03PS3) 10Andrea Denisse: doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) [20:53:06] (03CR) 10Andrea Denisse: doc: Add role::doc to doc2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [20:53:08] (03CR) 10CI reject: [V: 04-1] miscweb/static_rt: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [20:53:25] (03CR) 10Dzahn: [C: 03+2] "This is commented out before and after, so just merging this one first right now, waiting for reviews for the active monitoring." [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [20:53:55] (03PS3) 10Dzahn: miscweb/static_rt: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) [20:54:59] (03PS4) 10Andrea Denisse: doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) [20:56:14] (03PS4) 10Dzahn: miscweb/static_rt: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902802 (https://phabricator.wikimedia.org/T331901) [20:56:38] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40325/console" [puppet] - 10https://gerrit.wikimedia.org/r/902489 (https://phabricator.wikimedia.org/T332819) (owner: 10Andrea Denisse) [20:58:28] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40326/console" [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [21:00:53] (03CR) 10Dzahn: [C: 03+2] Revert "maintenance: temp allow rsyncing home dir to miscweb" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [21:08:48] !log mwmaint1002 ferm rules for rsyncd_access from miscweb removed by puppet after I4fe17f397856361 which reverted a8af0339bde14018e8. manually deleted rsyncd config and stopped rsync service. complete noop on mwmaint2002 which is currently the active mwmaint server. T328907 [21:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:54] T328907: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 [21:10:29] (03CR) 10Dzahn: [C: 03+2] "In hindsight I should have used ensure => absent but I manually deleted rsyncd config and stopped the service and ensured it is not starta" [puppet] - 10https://gerrit.wikimedia.org/r/902156 (owner: 10Dzahn) [21:21:34] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: March 2023 Datacenter Switchover Excluded services - https://phabricator.wikimedia.org/T329193 (10Dzahn) excluded by releng: - https://integration.wikimedia.org - https://releases-jenkins.wikimedia.org [21:22:52] (03PS1) 10JHathaway: Add an in place Debian upgrade script [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) [21:23:27] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [21:43:18] (03PS1) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) [21:43:20] (03PS1) 10David Caro: maintaint-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [21:43:22] (03PS1) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [21:43:24] (03PS1) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [21:43:26] (03PS1) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [21:46:12] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [21:46:23] (03CR) 10CI reject: [V: 04-1] maintaint-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro) [21:46:36] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro) [21:48:20] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) (owner: 10David Caro) [21:48:33] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [22:08:15] (03PS2) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663) [22:08:17] (03PS2) 10David Caro: maintaint-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [22:08:19] (03PS2) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [22:08:21] (03PS2) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [22:08:23] (03PS2) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [22:13:13] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) (owner: 10David Caro) [22:13:31] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [22:15:19] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10CRoslof) @BCornwall No updates at the moment. We have a lot of items in our enforcement queue, so it can take a while. If there is a particular need to have these domain names registered so that they... [22:19:02] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Thanks, @CRoslof! Was this not in the queue before? It's a pretty old ticket! [22:38:36] (03PS3) 10David Caro: maintaint-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [22:38:38] (03PS3) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [22:38:40] (03PS3) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [22:38:42] (03PS3) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [22:39:20] (03PS4) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) [22:39:22] (03PS4) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) [22:39:24] (03PS4) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) [22:39:26] (03PS4) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [22:43:03] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) 05Open→03Stalled [22:43:50] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [22:45:59] 10SRE, 10Traffic, 10HTTPS, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [22:46:07] 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05In progress→03Stalled Is there any outcome to envision other than moving the domain? If not, I can close this and open another ticket to move store.wikimedia.org to a differ... [22:47:06] RECOVERY - ps1-e8-eqiad-infeed-load-tower-A-phase-X on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-A-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:06] RECOVERY - ps1-e8-eqiad-infeed-load-tower-B-phase-X on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-B-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:24] RECOVERY - ps1-e8-eqiad-infeed-load-tower-B-phase-Y on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-B-phase-Y 41 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:26] RECOVERY - ps1-e8-eqiad-infeed-load-tower-A-phase-Y on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-A-phase-Y 50 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:40] RECOVERY - ps1-e8-eqiad-infeed-load-tower-B-phase-Z on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-B-phase-Z 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:47:56] (03PS1) 10Urbanecm: GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) [22:48:04] RECOVERY - ps1-e8-eqiad-infeed-load-tower-A-phase-Z on ps1-e8-eqiad is OK: SNMP OK - ps1-e8-eqiad-infeed-load-tower-A-phase-Z 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:50:02] 10SRE, 10Traffic, 10Patch-For-Review: Deprecate and disable port 80 for one-off sites under canonical domains - https://phabricator.wikimedia.org/T238720 (10BCornwall) p:05Medium→03Low [22:56:00] RECOVERY - ps1-f8-eqiad-infeed-load-tower-B-phase-X on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-B-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:00] RECOVERY - ps1-f8-eqiad-infeed-load-tower-B-phase-Y on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-B-phase-Y 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:04] RECOVERY - ps1-f8-eqiad-infeed-load-tower-A-phase-Y on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-A-phase-Y 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:04] RECOVERY - ps1-f8-eqiad-infeed-load-tower-A-phase-Z on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-A-phase-Z 72 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:38] RECOVERY - ps1-f8-eqiad-infeed-load-tower-B-phase-Z on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-B-phase-Z 69 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:38] RECOVERY - ps1-f8-eqiad-infeed-load-tower-A-phase-X on ps1-f8-eqiad is OK: SNMP OK - ps1-f8-eqiad-infeed-load-tower-A-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:59:09] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) [23:00:00] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Papaul) [23:00:12] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Papaul) 05Open→03Resolved This is complete [23:04:17] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [23:11:56] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) [23:15:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Papaul) [23:35:38] (03PS1) 10Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) [23:37:29] (03CR) 10CI reject: [V: 04-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [23:44:16] (03PS2) 10Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) [23:45:11] (03PS1) 10Cwhite: add haproxy ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902611 (https://phabricator.wikimedia.org/T234565) [23:45:29] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40328/console" [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [23:46:13] (03CR) 10CI reject: [V: 04-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [23:50:58] !log removing 1 file for legal compliance [23:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:01] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Add role::doc to doc2002 [puppet] - 10https://gerrit.wikimedia.org/r/902505 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [23:57:29] !log denisse@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "doc2002 - denisse@cumin1001 - T332819" [23:57:35] T332819: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 [23:58:45] !log denisse@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "doc2002 - denisse@cumin1001 - T332819" [23:59:07] (03PS5) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)