[00:02:05] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:17] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:21] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-07-12 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:10:49] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-07-12 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:17:15] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:49] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS bullseye [00:28:20] (03PS1) 10Dzahn: add gerrit-replica-new.wikimedia.org, point to 208.80.153.109 [dns] - 10https://gerrit.wikimedia.org/r/815395 (https://phabricator.wikimedia.org/T313250) [00:32:59] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-07-12 00:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:39:09] (03PS4) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) [00:39:31] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2051.codfw.wmnet with reason: host reimage [00:43:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2051.codfw.wmnet with reason: host reimage [00:54:28] (03PS1) 10Tim Starling: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188) [00:56:31] (03PS1) 10Dzahn: gerrit: add gerrit role and hiera settings for replica to gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) [00:56:33] (03PS1) 10Tim Starling: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188) [00:57:58] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:58:33] (03PS1) 10Tim Starling: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188) [00:58:56] (03PS1) 10Tim Starling: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188) [01:00:28] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2051.codfw.wmnet with OS bullseye [01:01:15] (03PS1) 10Dzahn: acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) [01:04:49] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2052.codfw.wmnet with OS bullseye [01:08:18] (03PS1) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) [01:10:16] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-07-19 00:00:01 (3282 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:10:55] (03PS1) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) [01:12:06] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 102.2 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [01:12:21] (03CR) 10Dzahn: "this goes into /var/lib/gerrit2 on gerrit1001. that's the actual home dir" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [01:15:46] (03PS1) 10Dzahn: gerrit: add hiera data for a second replica [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) [01:24:42] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [01:24:42] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage [01:26:05] (03CR) 10Dzahn: [C: 04-1] "this is not ready yet but I wanted to list it for tomorrow's meeting" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [01:27:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2052.codfw.wmnet with reason: host reimage [01:27:17] (03CR) 10Dzahn: [C: 04-2] "can't be merged before we have the IP in netbox and DNS" [puppet] - 10https://gerrit.wikimedia.org/r/815396 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [01:28:08] (03CR) 10Dzahn: [C: 03+1] "This can go first to get things out of the way I suppose." [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [01:28:30] (03CR) 10Dzahn: "I should also give you shell to gerrit2002..." [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [01:31:35] (03PS1) 10Dzahn: admin/gerrit: add gerrit shell admins on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250) [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:20] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [01:44:42] (03PS2) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) [01:49:05] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2052.codfw.wmnet with OS bullseye [01:53:03] (03CR) 10Tim Starling: [C: 03+2] Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [01:53:09] (03CR) 10Tim Starling: [C: 03+2] Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [01:57:22] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:58:24] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 100.8 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [02:04:28] (03PS1) 10Tim Starling: Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) [02:10:13] (03Merged) 10jenkins-bot: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815285 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:11:43] (03Merged) 10jenkins-bot: Temporarily disable yue language fallback tests [extensions/Wikibase] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815406 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:12:15] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:13:35] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-07-19 00:00:01 (3261 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:19:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:23:57] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:13] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:25:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:25:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:27:15] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:44] (03CR) 10Tim Starling: [C: 03+2] Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:31:47] (03CR) 10Tim Starling: [C: 03+2] Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:41:36] I am doing this merge and deployment for Winston_Sung[m], following the discussion in this channel last night my time [02:42:15] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:44:47] because I read the task comments with the "confusion, concern and shock" and all that [02:46:30] (03Merged) 10jenkins-bot: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815407 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:47:40] not saying I know what the big deal is, I know spoken Cantonese is quite distant from Mandarin but I thought the written languages were pretty close? [02:48:05] (03Merged) 10jenkins-bot: Temporarily revert language fallback chain changes to yue [core] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815408 (https://phabricator.wikimedia.org/T296188) (owner: 10Tim Starling) [02:48:58] > I am doing this merge and deployment for Winston_Sung, following the discussion in this channel last night my time [02:48:58] Thanks for the help. [02:49:54] The biggest issue is that they don't want to see the Simplified Han script on the wiki. [02:51:12] !log tstarling@deploy1002 Started scap: revert yue -> zh fallback, needs LC rebuild in both branches T296188 [02:51:14] And due to the updated fallback chain to zh and zh-hans, the Tech News pushed the one contains Simplified Han script. [02:51:15] T296188: Clean up, merge, update zh/zh-* translations and update zh-related language fallback chains in mediawiki/core - https://phabricator.wikimedia.org/T296188 [02:51:58] right [02:53:08] They strongly opposed to have the Simplified Han script content. [02:53:43] So the fallback chain update for yue need more discussions. [02:54:26] got it [02:54:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:58:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:58:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:59:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:10:53] !log tstarling@deploy1002 Finished scap: revert yue -> zh fallback, needs LC rebuild in both branches T296188 (duration: 19m 41s) [03:10:58] T296188: Clean up, merge, update zh/zh-* translations and update zh-related language fallback chains in mediawiki/core - https://phabricator.wikimedia.org/T296188 [03:15:01] OK, that worked, I tested this special page alias before and after: https://zh-yue.wikipedia.org/wiki/Special:%E6%89%80%E6%9C%89%E9%A1%B5%E9%9D%A2 [03:15:31] now 404, previously it was a redirect [03:16:25] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:16:25] PROBLEM - Host dns1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:25] PROBLEM - Host authdns1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:25] PROBLEM - Host logstash1011 is DOWN: PING CRITICAL - Packet loss = 100% [03:16:25] PROBLEM - Host bast4003 is DOWN: PING CRITICAL - Packet loss = 100% [03:17:38] PROBLEM - Host kubemaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:17:50] PROBLEM - Host wcqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:02] PROBLEM - Host kubernetes1012 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:04] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:08] PROBLEM - Host wtp1038 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:08] PROBLEM - Host wtp1037 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:08] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:18:16] PROBLEM - Host db1146 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:18:18] PROBLEM - Host wtp1039 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:33] PROBLEM - Host pc1013 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:18:37] PROBLEM - Host db1120 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:18:52] PROBLEM - Host db1145 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:52] PROBLEM - Host gitlab-runner1003 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:04] PROBLEM - Host dbproxy1021 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:04] PROBLEM - Host dbproxy1019 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:04] PROBLEM - Host mwdebug1001 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:06] PROBLEM - Host dbproxy1020 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:06] PROBLEM - Host matomo1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:06] PROBLEM - Host logstash1025 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:10] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:10] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:15] wuh oh [03:19:16] PROBLEM - Host aqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:20] PROBLEM - Host an-tool1007 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:20] PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:20] PROBLEM - Host an-conf1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:21] PROBLEM - Host db1181 #page is DOWN: PING CRITICAL - Packet loss = 100% [03:19:25] Blame Tim [03:19:28] PROBLEM - Host dbproxy1018 is DOWN: PING CRITICAL - Packet loss = 100% [03:19:36] (joke) [03:19:53] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2060.codfw.wmnet with OS bullseye [03:19:57] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2060.codfw.wmnet with OS bullseye [03:19:59] pretty sure I'm gonna blame a rack switch but let's see [03:20:02] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 321 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:20:06] PROBLEM - Host aqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:20:56] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:21:14] PROBLEM - MariaDB Replica IO: s7 on db1171 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1181.eqiad.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:22:08] PROBLEM - MariaDB Replica IO: s7 #page on db1127 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1181.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db1181.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:22:09] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:22:09] PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [03:22:14] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{project}/{acc [03:22:14] }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/top/{referer}/{media_type}/{year}/{month}/{day} (Get top files by mediarequests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [03:22:18] who took down meta? [03:22:21] RECOVERY - Host mwdebug1001 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [03:22:21] RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [03:22:23] RECOVERY - Host db1181 #page is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [03:22:23] RECOVERY - Host doh1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [03:22:24] hm, it's more than one rack [03:22:25] RECOVERY - Host db1146 #page is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms [03:22:26] RECOVERY - Host pc1013 #page is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [03:22:26] RECOVERY - Host ganeti1010 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:22:26] RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [03:22:31] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [03:22:31] RECOVERY - Host db1145 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [03:22:31] RECOVERY - Host wtp1038 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [03:22:33] RECOVERY - Host dbproxy1021 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [03:22:33] RECOVERY - Host aqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:33] RECOVERY - Host wtp1039 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:33] RECOVERY - Host dbproxy1018 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [03:22:35] RECOVERY - Host aqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [03:22:35] RECOVERY - Host wcqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [03:22:35] RECOVERY - Host dbproxy1019 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [03:22:37] RECOVERY - Host wtp1037 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:22:38] definitely not out of the woods yet, still looking [03:22:39] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:22:41] RECOVERY - Host kubemaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [03:22:41] RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [03:22:41] RECOVERY - Host kubernetes1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:41] RECOVERY - Host an-conf1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [03:22:43] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-a-s4_3314: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s [03:22:43] Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s2_3312: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s3_3313: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet [03:22:43] ed down but pooled: wikireplicas-a-s5_3315: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but poole https://wikitech.wikimedia.org/wiki/PyBal [03:22:43] RECOVERY - Host ganeti1024 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:22:46] RECOVERY - Host es1022 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:49] RECOVERY - Host gitlab-runner1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:22:49] RECOVERY - Host db1120 #page is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [03:22:50] RECOVERY - Host db1169 #page is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [03:22:51] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [03:22:54] RECOVERY - Host db1168 #page is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:22:54] RECOVERY - Host logstash1025 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [03:23:01] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:23:03] RECOVERY - Host dbproxy1020 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:23:10] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:23:13] RECOVERY - Host matomo1002 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [03:23:29] RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [03:23:35] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 508 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:23:51] (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:23:56] (ProbeDown) firing: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:24:09] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash1025.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:24:19] RECOVERY - MariaDB Replica IO: s7 on db1171 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:24:35] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s1_3311: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s [03:24:35] Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s3_3313: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s8_3318: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-a-s5_3315: Servers dbproxy1018.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eq [03:24:35] t are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-a-s3_3313: Servers dbproxy1018.eqiad.wmnet are marked down https://wikitech.wikimedia.org/wiki/PyBal [03:24:51] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [03:24:52] RECOVERY - MariaDB Replica IO: s7 #page on db1127 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:25:11] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 13 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:25:40] (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:41] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 49 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:25:51] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [03:26:06] (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:26:13] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 41 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:26:37] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:26:41] (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:26:45] (KubernetesCalicoDown) firing: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:26:47] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:26:54] "OK, that worked, I tested this..." <- Thanks. [03:27:13] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [03:27:41] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:28:01] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:28:21] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1004 is OK: OK: Less than 20.00% above the threshold [300.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [03:28:33] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy [03:28:33] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [03:29:19] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 16 down 8: https://wikitech.wikimedia.org/wiki/HAProxy [03:29:41] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:30:09] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:30:17] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [03:33:17] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [03:36:47] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [03:37:27] !log rzl@dbproxy1018:~$ sudo systemctl reload haproxy [03:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:39:08] (ProbeDown) firing: (3) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:09] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-07-19 00:00:01 (3261 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:39:12] (ProbeDown) resolved: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:39:18] (03PS2) 10KartikMistry: Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) [03:39:21] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [03:39:23] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:43:59] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:44:04] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:44:14] (JobUnavailable) resolved: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:44:19] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [03:44:34] (KubernetesCalicoDown) resolved: kubernetes1012.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:46:55] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [03:47:57] legoktm: Am I wrong to think that wikimediastatus.net isn't very informative on the subject of phab being down despite the topic's claims? It currently just says "all systems operational". [03:48:01] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:27] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [03:48:38] !log rzl@cumin2002:~$ sudo cumin dbproxy[1019,1020,1021].eqiad.wmnet 'systemctl reload haproxy' [03:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:41] Kemayo: right, the main status page is for wikis, not supporting services [03:49:09] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [03:49:19] phab restored 👍 [03:49:27] Which is fine -- it just feels weird to reference it in the topic as such. :D [03:50:52] yeah, I probably should've thrown in a few more words there, like "for wikis see..." [03:51:27] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [03:51:46] (ProbeDown) resolved: (3) Service phab1001:443 has failed probes (http_phabricator_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:53:56] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [03:54:30] logstash: I know, buddy, I know <3 [03:55:33] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [03:56:06] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [03:56:19] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [03:56:50] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [04:00:39] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:05:23] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:08:13] !log rzl@kubemaster1002:~$ sudo systemctl restart kube-apiserver [04:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:07] !log rzl@kubemaster1001:~$ sudo systemctl restart kube-apiserver [04:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:01] PROBLEM - Host analytics1068 is DOWN: PING CRITICAL - Packet loss = 100% [04:14:21] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:21:09] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [04:43:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [04:43:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:43:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:47:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [04:47:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [04:47:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31471 and previous config saved to /var/cache/conftool/dbconfig/20220720-044729-marostegui.json [04:47:33] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [04:50:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31472 and previous config saved to /var/cache/conftool/dbconfig/20220720-045004-marostegui.json [04:50:26] (03PS1) 10Marostegui: instances.yaml: Add db2168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815427 (https://phabricator.wikimedia.org/T311493) [04:54:14] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/815427 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [04:57:55] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [04:57:59] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [04:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2168 to dbctl in s7 and s8 T311493', diff saved to https://phabricator.wikimedia.org/P31473 and previous config saved to /var/cache/conftool/dbconfig/20220720-045918-marostegui.json [04:59:22] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:00:28] (03PS1) 10Marostegui: db2168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815428 (https://phabricator.wikimedia.org/T311493) [05:05:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31474 and previous config saved to /var/cache/conftool/dbconfig/20220720-050509-marostegui.json [05:06:35] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:09:49] (03CR) 10Marostegui: [C: 03+2] db2168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815428 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:13:25] (03PS1) 10Marostegui: site.pp: Remove insetup from db2167,db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815430 (https://phabricator.wikimedia.org/T311493) [05:14:46] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db2167,db2168 [puppet] - 10https://gerrit.wikimedia.org/r/815430 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P31475 and previous config saved to /var/cache/conftool/dbconfig/20220720-052014-marostegui.json [05:26:34] !log Stop mysql on db2087 (s6 and s7) to clone db2169 T311493 [05:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:38] T311493: Productionize db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T311493 [05:27:41] (03CR) 10Krinkle: [C: 03+1] Switch testwiki to multi-DC active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/815403 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [05:28:03] (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815516 (https://phabricator.wikimedia.org/T311493) [05:29:06] (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/815516 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [05:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31478 and previous config saved to /var/cache/conftool/dbconfig/20220720-053520-marostegui.json [05:35:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:35:24] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [05:35:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:36:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:36:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31479 and previous config saved to /var/cache/conftool/dbconfig/20220720-053620-marostegui.json [05:37:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31480 and previous config saved to /var/cache/conftool/dbconfig/20220720-053751-marostegui.json [05:40:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20 [05:40:39] ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:44:43] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31481 and previous config saved to /var/cache/conftool/dbconfig/20220720-055256-marostegui.json [05:56:25] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:03:09] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20 [06:03:09] ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P31482 and previous config saved to /var/cache/conftool/dbconfig/20220720-060802-marostegui.json [06:21:47] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31483 and previous config saved to /var/cache/conftool/dbconfig/20220720-062307-marostegui.json [06:23:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:23:13] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [06:23:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31484 and previous config saved to /var/cache/conftool/dbconfig/20220720-062327-marostegui.json [06:25:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31485 and previous config saved to /var/cache/conftool/dbconfig/20220720-062539-marostegui.json [06:28:14] (03PS1) 10PleaseStand: SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753) [06:29:30] (03PS1) 10PleaseStand: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753) [06:30:18] (03PS1) 10PleaseStand: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753) [06:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31486 and previous config saved to /var/cache/conftool/dbconfig/20220720-064044-marostegui.json [06:41:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:41:21] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [06:41:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [06:43:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2029.codfw.wmnet with OS bullseye [06:43:40] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2029.codfw.wmnet with OS bullseye [06:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31487 and previous config saved to /var/cache/conftool/dbconfig/20220720-065549-marostegui.json [06:57:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2029.codfw.wmnet with reason: host reimage [07:00:05] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T0700). Please do the needful. [07:00:05] Sohom_Datta, kart_, PleaseStand, PleaseStand, and PleaseStand: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:38] * kart_ is here. Sorry for delay [07:02:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2029.codfw.wmnet with reason: host reimage [07:03:14] Amir1: hi [07:03:20] I'm here :) [07:03:48] cool. Amir1 urbanecm Are you doing deployments? [07:05:30] OK. I can quickly deploy my change first while we are waiting for Amir1 / urbanecm [07:06:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nicely done! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [07:06:58] (03CR) 10KartikMistry: [C: 03+2] Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:07:55] (03Merged) 10jenkins-bot: Enable ContentTranslation out of Beta for sswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815251 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [07:09:12] o/ I can deploy today [07:09:45] taavi: I'm deploying my change, will let you know once done. [07:09:55] sure [07:10:54] (03PS1) 10Marostegui: mariadb: Productionize db2169 [puppet] - 10https://gerrit.wikimedia.org/r/815677 (https://phabricator.wikimedia.org/T311493) [07:10:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312990)', diff saved to https://phabricator.wikimedia.org/P31488 and previous config saved to /var/cache/conftool/dbconfig/20220720-071054-marostegui.json [07:10:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:10:59] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:11:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1136.eqiad.wmnet with reason: Maintenance [07:11:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31489 and previous config saved to /var/cache/conftool/dbconfig/20220720-071114-marostegui.json [07:12:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:14:53] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815251|Enable ContentTranslation out of Beta for sswiki (T309384)]] (duration: 03m 24s) [07:14:57] T309384: Enable Content and Section translation on wikipedias with new MT support from Flores - https://phabricator.wikimedia.org/T309384 [07:15:40] taavi: I'm done. [07:16:19] (03PS5) 10Majavah: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:16:41] (03CR) 10Majavah: [C: 03+2] Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:17:10] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10User-jbond: fetch_external_clouds_vendors_nets.py fails to update DigitalOcean network ranges - https://phabricator.wikimedia.org/T313206 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez DigitalOcean restored the CSV and it's now working as... [07:17:16] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, and 2 others: varnish filtering: should we automatically update public_cloud_nets - https://phabricator.wikimedia.org/T270391 (10Vgutierrez) [07:17:36] (03Merged) 10jenkins-bot: Enable edit-in-sequence on Beta Wikisource for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810054 (https://phabricator.wikimedia.org/T308098) (owner: 10Sohom Datta) [07:17:45] (03CR) 10JMeybohm: [C: 03+1] role::beta::docker_services: prune docker images [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori) [07:18:25] Sohom_Datta: merged your patch, it'll be automatically deployed to the beta cluster in the next 30 mins or so, ping me if it doesn't [07:18:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2029.codfw.wmnet with OS bullseye [07:18:37] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2029.codfw.wmnet with OS bullseye completed: - ganeti2029 (**PASS**) - Downtimed on... [07:18:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:18:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:05] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10fgiunchedi) >>! In T313102#8088079, @MatthewVernon wrote: > Are there other teams you think we should talk to before turning this off, then? Indeed, I know @hnowla... [07:19:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31490 and previous config saved to /var/cache/conftool/dbconfig/20220720-071927-marostegui.json [07:19:28] (03CR) 10Majavah: [C: 03+2] SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:19:30] (03CR) 10Majavah: [C: 03+2] populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:19:31] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [07:19:32] (03CR) 10Majavah: [C: 03+2] populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:19:54] PleaseStand: I'm guessing your patches can't really be tested? [07:20:30] PROBLEM - k8s requests count to the API on ml-serve-ctrl2002 is CRITICAL: 101.9 ge 100 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1 [07:20:37] Thanks a bunch, will let you know :) [07:20:49] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for Segun Oworu - https://phabricator.wikimedia.org/T313213 (10Vgutierrez) >>! In T313213#8089900, @AAlikhan wrote: > I'm approving this request for @soworu. Let me know if there's anything beyond this comment that I need to do to suppor... [07:21:10] taavi: I don't have production shell access, and probably don't have beta cluster shell access either [07:21:49] (03Merged) 10jenkins-bot: SecurePoll: Adding files for 2022 vote [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815411 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:21:51] (03Merged) 10jenkins-bot: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815412 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:21:58] (03Merged) 10jenkins-bot: populateEditCount: Call waitForReplication() every 500 users [extensions/SecurePoll] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815413 (https://phabricator.wikimedia.org/T309753) (owner: 10PleaseStand) [07:22:09] I know, I'm asking if there's anything that needs to be done to your patches before I sync them to the prod cluster [07:22:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:23:11] taavi: Should be OK, it's only a maintenance script that would be run manually, probably by foks [07:23:48] ok, thanks [07:23:53] yup that is correct [07:24:57] (03CR) 10David Caro: wmcs: don't page for most checks (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [07:25:58] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable x509 CN validation in blackbox (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:26:03] (03PS2) 10Filippo Giunchedi: prometheus: enable x509 CN validation in blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) [07:26:13] (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:26:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:26:58] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/SecurePoll/cli/wm-scripts/bv2022/: T309753 backports (duration: 02m 57s) [07:27:02] T309753: Create SecurePoll voter list for 2022 board vote - https://phabricator.wikimedia.org/T309753 [07:27:29] (03PS1) 10Marostegui: wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382) [07:27:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:28:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:28:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:29:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:30:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) Critical DB infra there: - dbproxy1020 (m3 current proxy): needs failover. - pc1013 active pc3 master: needs failover - db1181 s7 master: needs failover T313383... [07:30:22] !log kubernetes1010.eqiad.wmnet,kubernetes1020.eqiad.wmnet 'systemctl restart rsyslog' [07:30:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [07:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:40] !log taavi@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/SecurePoll/cli/wm-scripts/bv2022/populateEditCount.php: T309753 backports (duration: 02m 54s) [07:30:53] PleaseStand: ok, that should be everything synced [07:30:59] anyone have anything else to deploy? [07:31:28] !log ml-serve1002.eqiad.wmnet,ml-serve1004.eqiad.wmnet 'systemctl restart rsyslog' [07:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:45] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) p:05Triage→03High [07:32:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2169 [puppet] - 10https://gerrit.wikimedia.org/r/815677 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:33:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) [07:33:37] (03PS5) 10JMeybohm: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) [07:34:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31491 and previous config saved to /var/cache/conftool/dbconfig/20220720-073432-marostegui.json [07:34:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:35:23] 10SRE, 10Image-Suggestions: Envoy cannot connect to image-suggestion service - https://phabricator.wikimedia.org/T312225 (10JMeybohm) 05Open→03Resolved [07:35:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:35:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:35:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on ml-serve1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:35:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubernetes1010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:36:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:39:23] (03PS1) 10Majavah: P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681 [07:41:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:41:18] (03PS2) 10Filippo Giunchedi: prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847) [07:41:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet [07:41:35] (03PS2) 10Majavah: P:openstack::cinder: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815681 [07:42:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36315/console" [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah) [07:42:33] (03CR) 10Filippo Giunchedi: [V: 03+2] prometheus: use protocol in blackbox target files [puppet] - 10https://gerrit.wikimedia.org/r/815305 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:44:06] 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Vgutierrez) maybe @Ladsgroup would be a better fit to help here, but meanwhile, could you provide some details like which mailing list are you referring to? Thanks [07:46:00] (03PS1) 10Majavah: P:openstack::designate: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815683 [07:47:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36316/console" [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah) [07:47:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet [07:47:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) This didn't get caught by monitoring. We have a LibreNMS alert that triggers when any "emergency" log is sent by a device, but loo... [07:49:21] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10Peachey88) [07:49:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P31492 and previous config saved to /var/cache/conftool/dbconfig/20220720-074937-marostegui.json [07:54:06] (03CR) 10MVernon: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382) (owner: 10Marostegui) [07:59:13] (03CR) 10Filippo Giunchedi: "Thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:59:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add blackbox TCP check [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:59:32] (03PS2) 10Filippo Giunchedi: prometheus: add blackbox TCP check [puppet] - 10https://gerrit.wikimedia.org/r/815306 (https://phabricator.wikimedia.org/T305847) [08:00:05] jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T0800). [08:02:14] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:04:16] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:04:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T312990)', diff saved to https://phabricator.wikimedia.org/P31493 and previous config saved to /var/cache/conftool/dbconfig/20220720-080442-marostegui.json [08:04:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:04:47] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:04:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:05:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:05:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31494 and previous config saved to /var/cache/conftool/dbconfig/20220720-080509-marostegui.json [08:07:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31495 and previous config saved to /var/cache/conftool/dbconfig/20220720-080721-marostegui.json [08:08:14] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:08:16] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:37] (03CR) 10Volans: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 (owner: 10Jbond) [08:10:01] (03PS1) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) [08:11:06] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [08:11:24] 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Aklapper) 05Open→03Stalled Also, what exactly is an "id"? What does https://lists.wikimedia.org/user-profile/ say? [08:12:25] (03CR) 10CI reject: [V: 04-1] prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:12:41] (03PS2) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) [08:12:43] (03PS2) 10Filippo Giunchedi: syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) [08:12:45] (03PS6) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [08:12:47] (03PS14) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [08:14:07] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) [08:14:13] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10ayounsi) [08:14:17] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) [08:14:54] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [08:14:54] !log apt-get clean on archiva1002 to free some space [08:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q4: (Need By: TBD) rack/setup/install mw14[57-98] - https://phabricator.wikimedia.org/T306121 (10ayounsi) [08:16:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10ayounsi) [08:19:31] (03CR) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31496 and previous config saved to /var/cache/conftool/dbconfig/20220720-082226-marostegui.json [08:23:01] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey) [08:34:33] (03PS7) 10Ayounsi: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) [08:34:49] (03CR) 10Ayounsi: Netbox _get_circuits: add patch panel support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [08:37:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P31497 and previous config saved to /var/cache/conftool/dbconfig/20220720-083731-marostegui.json [08:43:38] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Resolved→03Open Since the replacement errors rate on one of the interfaces went though the roof: https://librenms.wikimedia.org/graphs/to=1658306... [08:48:24] (03CR) 10Jbond: prometheus: enable x509 CN validation in blackbox (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:50:45] (03CR) 10Jbond: prometheus: enable x509 CN validation in blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:52:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T312990)', diff saved to https://phabricator.wikimedia.org/P31498 and previous config saved to /var/cache/conftool/dbconfig/20220720-085236-marostegui.json [08:52:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:52:43] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [08:52:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [08:52:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31499 and previous config saved to /var/cache/conftool/dbconfig/20220720-085256-marostegui.json [09:00:28] (03CR) 10Ayounsi: provision cookbook: configure switches using cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [09:03:21] (03CR) 10Jbond: [C: 03+1] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [09:03:38] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:17] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:07:32] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [09:08:09] (03CR) 10Jbond: redfish: add a fqdn getter property and __str__ method (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [09:08:18] (03CR) 10Jbond: [C: 03+2] redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [09:08:21] (03PS8) 10Jbond: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 [09:09:05] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:09:26] (03PS6) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [09:10:05] (03PS1) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) [09:10:52] (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [09:13:59] (03PS5) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [09:14:01] (03PS5) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [09:14:29] (03PS6) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [09:14:33] (03PS6) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [09:14:51] (03PS2) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) [09:15:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10ayounsi) Opened high severity JTAC case 2022-0720-513915. In the meantime we need to discuss if we want to preemptively replace FPC5 with a... [09:15:58] (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [09:17:53] (03CR) 10Jbond: remote: add an __iter__ to RemoteHosts (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [09:19:27] (03CR) 10Volans: "replies inline, LGTM otherwise" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [09:21:43] (03CR) 10Jbond: redfish: Add property for the HttpPushURI (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:21:47] (03CR) 10Jbond: redfish: add a generation property (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:22:13] (03PS2) 10Jbond: remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 [09:22:47] (03PS2) 10Volans: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [09:26:29] (03PS7) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [09:29:01] (03CR) 10Jbond: redfish: add wait for reboot function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [09:29:21] (03CR) 10Jbond: [C: 03+2] redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:29:25] (03CR) 10Jbond: [C: 03+2] redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:30:06] (03CR) 10Ayounsi: [C: 03+1] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [09:31:08] jbond: XioNoX: hi, I am wondering whether we should move wikibugs notifications for homer/spicerack to another channel? :] [09:31:46] I /ignore wikibugs so dunno what you're talking about :) [09:31:51] ahah [09:31:54] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:32:30] (/cc volans) [09:33:58] (03PS7) 10Jbond: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 [09:34:04] (03CR) 10CI reject: [V: 04-1] redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:34:10] (03PS7) 10Jbond: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 [09:34:15] (03PS8) 10Jbond: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 [09:35:01] (03CR) 10Jbond: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [09:35:17] (03CR) 10Jbond: [C: 03+2] "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:37:48] (03Merged) 10jenkins-bot: redfish: add a fqdn getter property and __str__ method [software/spicerack] - 10https://gerrit.wikimedia.org/r/807968 (owner: 10Jbond) [09:41:23] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/wikibugs2/+/refs/heads/master/gerrit-channels.yaml#162 has a catch all for `operations/` repos [09:41:45] (03PS3) 10Volans: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [09:44:54] (03Merged) 10jenkins-bot: redfish: add a generation property [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:45:13] (03Merged) 10jenkins-bot: redfish: Add property for the HttpPushURI [software/spicerack] - 10https://gerrit.wikimedia.org/r/807970 (owner: 10Jbond) [09:46:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807969 (owner: 10Jbond) [09:46:46] (03PS2) 10David Caro: tests: Add nice message to runbook check test failure [alerts] - 10https://gerrit.wikimedia.org/r/815238 [09:46:57] (03CR) 10Jbond: [C: 03+2] redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [09:47:14] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [09:47:16] (03PS3) 10Jbond: remote: add an __iter__ to RemoteHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 [09:48:57] (03CR) 10Ayounsi: [C: 03+1] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [09:49:57] (03CR) 10JMeybohm: [C: 03+2] k8s/reboot-nodes: Error if nodes are cordoned (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:52:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [09:52:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2020.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [09:52:39] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [09:53:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31501 and previous config saved to /var/cache/conftool/dbconfig/20220720-095310-marostegui.json [09:53:14] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [09:53:24] (03PS3) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) [09:54:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2029.codfw.wmnet to cluster codfw and group A [09:55:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2029.codfw.wmnet to cluster codfw and group A [09:55:47] (03CR) 10Volans: remote: add an __iter__ to RemoteHosts (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [09:56:15] (03Merged) 10jenkins-bot: redfish: add wait for reboot function [software/spicerack] - 10https://gerrit.wikimedia.org/r/807971 (owner: 10Jbond) [09:56:17] (03Merged) 10jenkins-bot: k8s/reboot-nodes: Error if nodes are cordoned [cookbooks] - 10https://gerrit.wikimedia.org/r/812325 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:58:00] (03CR) 10Volans: [C: 03+2] config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [10:01:35] (03CR) 10Jbond: "lgtm optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:02:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:02:53] (03Merged) 10jenkins-bot: config: fix type hints for YAML callables [software/homer] - 10https://gerrit.wikimedia.org/r/814839 (owner: 10Ayounsi) [10:03:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:04:22] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [10:04:34] (03PS3) 10Volans: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi) [10:06:33] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/811730 (owner: 10Ayounsi) [10:07:07] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:08:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31502 and previous config saved to /var/cache/conftool/dbconfig/20220720-100815-marostegui.json [10:08:33] (03PS4) 10Elukey: kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) [10:08:47] (03CR) 10Volans: [C: 03+2] Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi) [10:09:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2020.codfw.wmnet with OS bullseye [10:09:16] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bullseye [10:09:30] (03CR) 10Ayounsi: [C: 03+2] Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [10:12:52] (03Merged) 10jenkins-bot: Add Python 3.10 support [software/homer] - 10https://gerrit.wikimedia.org/r/814827 (owner: 10Ayounsi) [10:13:13] (03CR) 10Klausman: [C: 03+1] kserve: upgrade to upstream release 0.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/815691 (https://phabricator.wikimedia.org/T311982) (owner: 10Elukey) [10:13:29] (03PS8) 10Volans: Netbox _get_circuits: add patch panel support [software/homer] - 10https://gerrit.wikimedia.org/r/813604 (https://phabricator.wikimedia.org/T304710) (owner: 10Ayounsi) [10:13:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:13:36] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:13:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ml-etcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:21:03] (03PS1) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 [10:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P31503 and previous config saved to /var/cache/conftool/dbconfig/20220720-102320-marostegui.json [10:24:49] (03PS3) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) [10:24:51] (03PS3) 10Filippo Giunchedi: syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) [10:24:53] (03PS7) 10Filippo Giunchedi: phabricator: switch to prometheus-only network probes/checks [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) [10:24:55] (03PS15) 10Filippo Giunchedi: WIP irc check via blackbox [puppet] - 10https://gerrit.wikimedia.org/r/805815 [10:24:59] (03CR) 10Filippo Giunchedi: prometheus: adjust blackbox check params/types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:25:01] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage [10:26:22] (03CR) 10Volans: "LGTM, minor nits inline" [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi) [10:27:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2020.codfw.wmnet with reason: host reimage [10:30:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:30:17] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [10:30:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2003.codfw.wmnet with reason: Switch instance to DRBD, T311686 [10:31:23] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:31:27] (03CR) 10Jbond: [C: 04-1] "thanks for the comment see in line, i have also -1 this as i remembered it is still a bit incomplete as it fails to mock ocsp response fil" [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond) [10:31:50] (03PS2) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 [10:32:13] (03CR) 10Ayounsi: "Thanks!" [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi) [10:34:03] (03PS3) 10Ayounsi: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 [10:35:39] (03CR) 10Filippo Giunchedi: prometheus: enable x509 CN validation in blackbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815304 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:37:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10jbond) [10:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T312990)', diff saved to https://phabricator.wikimedia.org/P31504 and previous config saved to /var/cache/conftool/dbconfig/20220720-103825-marostegui.json [10:38:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:38:31] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [10:38:39] (03CR) 10Ayounsi: [C: 03+2] CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi) [10:38:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:39:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:39:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1127.eqiad.wmnet with reason: Maintenance [10:39:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:39:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:40:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2121.codfw.wmnet with reason: Maintenance [10:40:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 12 hosts with reason: Maintenance [10:40:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 12 hosts with reason: Maintenance [10:43:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2020.codfw.wmnet with OS bullseye [10:43:25] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2020.codfw.wmnet with OS bullseye completed: - ganeti2020 (**PASS**) - Downtimed on... [10:44:14] (03CR) 10Jbond: [C: 03+1] CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi) [10:45:18] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.5.1 [software/homer] - 10https://gerrit.wikimedia.org/r/815694 (owner: 10Ayounsi) [10:49:11] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696 [10:50:15] (03PS41) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [10:51:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:51:56] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696 (owner: 10Volans) [10:56:06] (03PS1) 10Ayounsi: Release v0.5.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698 [10:57:23] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 4 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [10:57:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36317/console" [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [10:59:09] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698 (owner: 10Ayounsi) [10:59:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [11:01:12] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [11:02:38] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v3.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/815696 (owner: 10Volans) [11:03:00] !log draining ganeti2014 T310483 [11:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:56] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Release v0.5.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/815698 (owner: 10Ayounsi) [11:05:22] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.1 - ayounsi@cumin1001 [11:06:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.5.1 - ayounsi@cumin1001 [11:09:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [11:16:16] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m3-master [dns] - 10https://gerrit.wikimedia.org/r/815680 (https://phabricator.wikimedia.org/T313382) (owner: 10Marostegui) [11:17:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2009.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [11:17:27] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [11:17:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2009.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [11:17:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10Marostegui) m3-master dbproxy has been failed over. [11:25:09] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:34:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1167.eqiad.wmnet with reason: Maintenance [11:34:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31506 and previous config saved to /var/cache/conftool/dbconfig/20220720-113424-marostegui.json [11:34:28] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:34:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10dcaro) [11:35:07] RECOVERY - Check systemd state on parse2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:41:00] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) I tried flamegraph on the host again today. With 500* concurrent threads where it basically never stops accepting conne... [11:45:07] (03CR) 10LSobanski: "Side question, what's the definition of "stable" that would prompt the move to -operations?" [puppet] - 10https://gerrit.wikimedia.org/r/814926 (owner: 10Dzahn) [11:50:27] 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10LSobanski) Considering the plan to migrate away from miscweb, are there any reasons not to deploy this to K8s from the get go? [11:52:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31507 and previous config saved to /var/cache/conftool/dbconfig/20220720-115233-marostegui.json [11:52:38] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [11:54:09] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:58] (03CR) 10Jelto: [C: 03+2] gitlab_runner: Allow DNS requests from GitLab runner containers in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/812264 (https://phabricator.wikimedia.org/T311241) (owner: 10Jelto) [12:07:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31509 and previous config saved to /var/cache/conftool/dbconfig/20220720-120738-marostegui.json [12:13:15] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: adjust blackbox check params/types [puppet] - 10https://gerrit.wikimedia.org/r/815685 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:13:35] (03CR) 10Filippo Giunchedi: [C: 03+2] syslog: probe TLS endpoint with blackbox [puppet] - 10https://gerrit.wikimedia.org/r/815307 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:14:17] (03PS1) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) [12:15:25] (03CR) 10CI reject: [V: 04-1] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:17:37] (03PS42) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [12:19:57] RECOVERY - Host analytics1068 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:22:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P31510 and previous config saved to /var/cache/conftool/dbconfig/20220720-122246-marostegui.json [12:25:40] (03PS2) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) [12:26:36] (03CR) 10CI reject: [V: 04-1] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:26:45] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Remove wgGEHomepageTutorialTitle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811664 (owner: 10Urbanecm) [12:27:46] (03CR) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [12:29:07] !log Move pc1014 from pc2 to pc3 T313401 [12:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:12] T313401: Move pc1014 from pc2 to pc3 - https://phabricator.wikimedia.org/T313401 [12:29:24] (03PS3) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) [12:30:23] (03PS1) 10Marostegui: pc1014: Move it from pc2 to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/815706 (https://phabricator.wikimedia.org/T313401) [12:30:48] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36320/console" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:32:58] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it from pc2 to pc3 [puppet] - 10https://gerrit.wikimedia.org/r/815706 (https://phabricator.wikimedia.org/T313401) (owner: 10Marostegui) [12:33:26] (03PS16) 10Filippo Giunchedi: mw_rc_irc: check ircd availability with blackbox prober [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) [12:36:46] (03PS43) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [12:37:09] (03CR) 10David Caro: [V: 03+1 C: 03+2] "the PCC looks good, will deploy one-by-one" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:37:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T312990)', diff saved to https://phabricator.wikimedia.org/P31511 and previous config saved to /var/cache/conftool/dbconfig/20220720-123751-marostegui.json [12:37:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:37:58] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:38:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [12:39:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:39:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1171.eqiad.wmnet with reason: Maintenance [12:40:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:40:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1101.eqiad.wmnet with reason: Maintenance [12:40:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31512 and previous config saved to /var/cache/conftool/dbconfig/20220720-124042-marostegui.json [12:41:34] (03PS4) 10David Caro: rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) [12:43:26] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36321/console" [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:43:38] (03PS44) 10Jbond: sre.hardware.firmware-upgrade: create new cookbook for updating idrac and bios [cookbooks] - 10https://gerrit.wikimedia.org/r/763215 [12:44:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31513 and previous config saved to /var/cache/conftool/dbconfig/20220720-124453-marostegui.json [12:45:01] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [12:46:38] (03CR) 10David Caro: [V: 03+1 C: 03+2] rabbit: introduce the heartbeat_timeout param and double [puppet] - 10https://gerrit.wikimedia.org/r/815705 (https://phabricator.wikimedia.org/T313400) (owner: 10David Caro) [12:49:12] (03PS1) 10Marostegui: wmnet: Failover s7 master [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) [12:49:38] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [12:50:19] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10bking) a:05Papaul→03bking [12:50:39] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10bking) I think there's a cookbook that can fix this, will grab the ticket and give it a shot. [12:51:02] (03PS1) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) [12:52:43] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [12:58:24] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover s7 master [dns] - 10https://gerrit.wikimedia.org/r/815709 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [13:00:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31514 and previous config saved to /var/cache/conftool/dbconfig/20220720-130000-marostegui.json [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:21] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10dcausse) We might perhaps be able to drop all wdqs artifacts prior to 0.3.40, this is the oldest reference I found here: https://github.com/wmde/wikibase-relea... [13:00:21] I think I’d like to backport a fix [13:00:28] but I could also use a break first ^^ [13:00:34] so maybe a bit later in the window [13:00:43] (I’ll add it to the wikitech page if anything happens) [13:03:11] PROBLEM - Check systemd state on restbase1026 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:23] PROBLEM - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is CRITICAL: connect to address 10.64.48.182 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:03:31] PROBLEM - cassandra-c service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:04:09] PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:04:13] 10SRE, 10SRE-OnFire, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10CDanis) filter_victorops_calendar requires some persistent storage, ideally a plain filesystem although we could figure out something els... [13:06:49] (03PS1) 10Lucas Werkmeister (WMDE): Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) [13:07:09] (03PS1) 10Lucas Werkmeister (WMDE): Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) [13:07:38] (03CR) 10Ladsgroup: [C: 04-1] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [13:08:05] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [13:08:27] (03PS2) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) [13:08:38] (03CR) 10Marostegui: mariadb: Promote db1136 to s7 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [13:09:15] (03CR) 10Michael Große: [C: 03+1] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:10:25] RECOVERY - Check systemd state on restbase1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:47] RECOVERY - cassandra-c service on restbase1026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:11:18] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:11:24] alright, let’s start backporting [13:11:25] RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2023-04-14 11:21:30 +0000 (expires in 267 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [13:11:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "(Sync src/ first, then WikibaseLexeme.resources.php.)" [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:12:41] (03CR) 10Ladsgroup: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/815710 (https://phabricator.wikimedia.org/T313383) (owner: 10Marostegui) [13:13:03] RECOVERY - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.182 port 9042 https://phabricator.wikimedia.org/T93886 [13:15:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318', diff saved to https://phabricator.wikimedia.org/P31515 and previous config saved to /var/cache/conftool/dbconfig/20220720-131505-marostegui.json [13:15:30] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2034.codfw.wmnet with OS bullseye [13:15:41] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2034.codfw.wmnet with OS bullseye [13:17:07] (03PS1) 10Filippo Giunchedi: prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) [13:17:50] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10phaultfinder) [13:18:26] * MichaelG_WMDE is also here and ready to support testing of that backport [13:18:40] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36322/console" [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:19:06] (03CR) 10Filippo Giunchedi: prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:19:10] taavi: ^ [13:20:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36323/console" [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:20:22] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: cr2-eqiad:FPC3 partial failure (PIC2/3) - https://phabricator.wikimedia.org/T312745 (10ayounsi) 05Open→03Resolved Nevermind, tracked in T313337 [13:20:53] godog: thanks! lgtm [13:21:46] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix blackbox timeout Pattern [puppet] - 10https://gerrit.wikimedia.org/r/815713 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [13:21:54] taavi: cheers, thanks for letting me know [13:22:59] (03CR) 10Vgutierrez: [C: 03+2] admin/gerrit: add gerrit shell admins on gerrit2002 [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [13:23:40] (03PS1) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 [13:26:54] (03CR) 10CI reject: [V: 04-1] Add additional interpreters, drop 3.6 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto) [13:27:14] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:27:28] (03CR) 10Hashar: "I have asked Valentin to deploy this now so I can get access to gerrit2002.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/815402 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [13:27:43] (03Merged) 10jenkins-bot: Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815425 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:28:43] okay, wmf.21 backport should be on mwdebug1001, let’s test it (cc MichaelG_WMDE) [13:28:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 231, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:29:04] * MichaelG_WMDE looks [13:29:39] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2034.codfw.wmnet with reason: host reimage [13:30:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31516 and previous config saved to /var/cache/conftool/dbconfig/20220720-133010-marostegui.json [13:30:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance [13:30:14] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [13:30:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1126.eqiad.wmnet with reason: Maintenance [13:30:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31517 and previous config saved to /var/cache/conftool/dbconfig/20220720-133030-marostegui.json [13:31:10] works for me! @Lucas_WMDE [13:31:18] same here, thanks for testing! [13:33:08] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2034.codfw.wmnet with reason: host reimage [13:33:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31518 and previous config saved to /var/cache/conftool/dbconfig/20220720-133336-marostegui.json [13:33:50] !log cr2-eqiad# deactivate interfaces xe-3/3/0 - [13:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:53] !log cr2-eqiad# deactivate interfaces xe-3/3/0 - T313337 [13:33:56] hmm, a scap warning about “restart-php-fmp-all … called with an empty host list” [13:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:58] T313337: Inbound interface errors - https://phabricator.wikimedia.org/T313337 [13:34:39] I guess that’s fine because the second sync in a moment will restart php-fpm again (and it looks like the file itself was synced) [13:35:24] !log installing request-tracker4 security updates [13:35:26] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10ayounsi) Looks like two interfaces are/were showing errors: cr2-eqiad:xe-3/0/3 - remote side seeing inbound errors: https://librenms.wikimedia.org/graphs/to=1658306400/id=12731/type=port_errors/from=1658133600/ I re-enabled the... [13:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:55] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/WikibaseLexeme/src/MediaWiki/Config/LexemeLanguageCodePropertyIdConfig.php: Backport: [[gerrit:815425|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (1/2) (duration: 03m 34s) [13:36:00] T313116: Special:NewLexemeAlpha doesn’t work on mobile - https://phabricator.wikimedia.org/T313116 [13:36:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:08] same mesage again [13:37:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:37:34] it doesn’t mention which host it’s for (if any), but the scap still takes the usual amount of time after that [13:37:43] so to me it feels like (most?) hosts are still getting their php-fpm restarted [13:38:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:38:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:39:19] to me it works on multiple different hosts [13:39:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/WikibaseLexeme/WikibaseLexeme.resources.php: Backport: [[gerrit:815425|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (2/2) (duration: 03m 08s) [13:39:45] https://test.m.wikidata.org/wiki/Special:NewLexemeAlpha works for me without mwdebug, at least [13:39:53] I guess I won’t worry about it then [13:40:16] but I’ll paste the full message for reference: [13:40:20] `Job /usr/bin/sudo -u root -- /usr/local/sbin/restart-php-fpm-all php7.2-fpm 9223372036854775807 called with an empty host list.` [13:40:33] Lucas_WMDE: scap was updated yesterday [13:44:24] (03Merged) 10jenkins-bot: Load Special:NewLexemeAlpha RL modules on mobile [extensions/WikibaseLexeme] (wmf/1.39.0-wmf.19) - 10https://gerrit.wikimedia.org/r/815726 (https://phabricator.wikimedia.org/T313116) (owner: 10Lucas Werkmeister (WMDE)) [13:45:24] !log installing containerd security updates in Kubernetes codfw cluster [13:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:40] jnuche or jeena: any idea why wmf.19/extensions/WikibaseLexeme/ is “in the middle of an am session”? [13:47:55] (asking you since your names appear in ls -l .git/modules/extensions/WikibaseLexeme) [13:48:39] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10ayounsi) p:05Triage→03High a:03Cmjohnson [13:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31519 and previous config saved to /var/cache/conftool/dbconfig/20220720-134841-marostegui.json [13:48:47] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2034.codfw.wmnet with OS bullseye [13:48:53] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2034.codfw.wmnet with OS bullseye completed: - elastic2034 (**PAS... [13:49:43] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10BBlack) a:03Jdforrester-WMF Hi - the process for the public certs+DN... [13:52:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:53:02] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:53:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:53:33] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10Volans) I was able to fix `elastic2067` via local IPMI. I've added the following sections to wikitech: * https://wikitech.wikimedia.org/wiki/Management_Interfaces#I... [13:54:10] !log lucaswerkmeister-wmde@deploy1002 /srv/mediawiki-staging (master $ u=) $ git -C php-1.39.0-wmf.19/extensions/WikibaseLexeme am --skip # T308659 backport already applied [13:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:14] T308659: Validate lemma length in Special:NewLexeme(Alpha) and label/description/aliases length in Special:NewProperty (CVE-2022-34750) - https://phabricator.wikimedia.org/T308659 [13:54:49] okay, now the submodule update works [13:54:55] pulled to mwdebug1001 (cc MichaelG_WMDE) [13:55:12] * MichaelG_WMDE looks [13:55:58] seems to load here, at least (I don’t really want to actually create a lexeme) [13:56:09] looks good to me (not creating a test Lexeme here, because this is production^^) [13:56:14] ^^ [13:57:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:57:28] Lucas_WMDE: there was a merge conflict when applying security patches during the train deployment of wmf.19 [13:57:39] I think that's what you were seeing [13:57:46] yes, I think I resolved it now [13:57:54] ok, sorry about that [13:59:47] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/WikibaseLexeme/src/MediaWiki/Config/LexemeLanguageCodePropertyIdConfig.php: Backport: [[gerrit:815726|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (1/2) (duration: 02m 56s) [13:59:53] T313116: Special:NewLexemeAlpha doesn’t work on mobile - https://phabricator.wikimedia.org/T313116 [14:00:55] (03CR) 10David Caro: "Sorry for the partial review, things are quite busy lately" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:01:12] 10SRE, 10Traffic, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10HTTPS: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Jdforrester-WMF) >>! In T313227#8091301, @BBlack wrote: > Hi - the pro... [14:01:16] (03CR) 10David Caro: "Sorry for the partial review, things are quite busy." [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [14:02:48] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10cmooney) Agreed this is a good idea. I can see why it may have been "left alone" previously but given we'd had issues best to bite the bullet and do it. The 40G u... [14:02:50] !log disable puppet on A:cp to deplot Gerrit:768766 [14:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:12] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10Volans) Updated the comment above as I made the command safer directly in the docs :) [14:03:16] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.39.0-wmf.19/extensions/WikibaseLexeme/WikibaseLexeme.resources.php: Backport: [[gerrit:815726|Load Special:NewLexemeAlpha RL modules on mobile (T313116)]] (2/2) (duration: 03m 02s) [14:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126', diff saved to https://phabricator.wikimedia.org/P31520 and previous config saved to /var/cache/conftool/dbconfig/20220720-140346-marostegui.json [14:04:05] !log UTC afternoon backport+config window done [14:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:48] (03PS1) 10Volans: Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722 [14:06:38] (03CR) 10Volans: [C: 03+2] Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722 (owner: 10Volans) [14:07:54] (03CR) 10Ori: [C: 03+2] role::beta::docker_services: prune docker images [puppet] - 10https://gerrit.wikimedia.org/r/815335 (https://phabricator.wikimedia.org/T313334) (owner: 10Ori) [14:09:01] (03PS2) 10Marostegui: Put cloudweb100[34] into service [puppet] - 10https://gerrit.wikimedia.org/r/815378 (https://phabricator.wikimedia.org/T305414) (owner: 10Andrew Bogott) [14:11:51] (03PS2) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 [14:13:25] (03CR) 10CI reject: [V: 04-1] Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto) [14:13:27] (03Merged) 10jenkins-bot: Upstream release v3.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/815722 (owner: 10Volans) [14:17:24] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:17:28] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1126 (T312990)', diff saved to https://phabricator.wikimedia.org/P31521 and previous config saved to /var/cache/conftool/dbconfig/20220720-141851-marostegui.json [14:18:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:18:57] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [14:19:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1177.eqiad.wmnet with reason: Maintenance [14:19:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/768766 (owner: 10Jbond) [14:19:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31522 and previous config saved to /var/cache/conftool/dbconfig/20220720-141912-marostegui.json [14:21:21] (03PS1) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723 [14:22:14] (03PS1) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 [14:22:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31523 and previous config saved to /var/cache/conftool/dbconfig/20220720-142214-marostegui.json [14:23:38] (03PS2) 10Majavah: P:openstack::trove: use new rabbitmq_hosts hiera var [puppet] - 10https://gerrit.wikimedia.org/r/815723 [14:24:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36326/console" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah) [14:25:14] (03PS1) 10Jbond: Revert "P:varnish::common: Add support for passing wikimedia_domains" [puppet] - 10https://gerrit.wikimedia.org/r/815727 [14:26:08] !log installing containerd security updates in Kubernetes codfw masters [14:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:varnish::common: Add support for passing wikimedia_domains" [puppet] - 10https://gerrit.wikimedia.org/r/815727 (owner: 10Jbond) [14:26:45] (03PS1) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 [14:26:57] (03CR) 10Ayounsi: [C: 03+1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [14:29:46] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:34] (03PS3) 10Giuseppe Lavagetto: Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 [14:30:45] (03CR) 10Andrew Bogott: [C: 03+1] "This looks good, and pcc agrees." [puppet] - 10https://gerrit.wikimedia.org/r/815681 (owner: 10Majavah) [14:36:00] !log uploaded spicerack_3.1.0 to apt.wikimedia.org bullseye-wikimedia [14:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add additional interpreters, drop 3.6, 3.5 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815715 (owner: 10Giuseppe Lavagetto) [14:37:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31524 and previous config saved to /var/cache/conftool/dbconfig/20220720-143719-marostegui.json [14:37:24] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:42:22] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [14:44:06] !log installing spicearck 3.1.0 on cumin2002 [14:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:31] (03CR) 10Ahmon Dancy: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [14:50:38] (03PS3) 10Giuseppe Lavagetto: Avoid additional errors if connection to poolcounter server fails [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy) [14:50:40] (03PS2) 10Giuseppe Lavagetto: Handle socket.timeout the same way as TimeoutError [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814893 (owner: 10Ahmon Dancy) [14:50:42] (03PS1) 10Giuseppe Lavagetto: Raise the default connection timeout to 2 seconds [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815747 (https://phabricator.wikimedia.org/T310835) [14:50:44] (03PS1) 10Giuseppe Lavagetto: New version 0.0.3 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815748 [14:50:46] (03PS1) 10Giuseppe Lavagetto: New package version 0.0.3-1 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815749 [14:52:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P31527 and previous config saved to /var/cache/conftool/dbconfig/20220720-145224-marostegui.json [14:52:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Avoid additional errors if connection to poolcounter server fails (031 comment) [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/814851 (https://phabricator.wikimedia.org/T310835) (owner: 10Ahmon Dancy) [14:53:25] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815683 (owner: 10Majavah) [14:53:42] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:54:24] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815723 (owner: 10Majavah) [14:56:38] (03CR) 10Ahmon Dancy: [C: 03+1] Raise the default connection timeout to 2 seconds [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815747 (https://phabricator.wikimedia.org/T310835) (owner: 10Giuseppe Lavagetto) [14:59:02] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [15:03:19] (ProbeDown) firing: (2) Service text-https:443 has failed probes (http_text-https_ip6) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:03:19] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:21] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2036.codfw.wmnet with OS bullseye [15:04:27] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye [15:04:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:04:40] three minutes into the shift 😖 [15:04:48] (03CR) 10Hashar: [C: 03+1] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:04:49] xDDD [15:05:11] rzl: https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=All&var-target_site=eqsin&var-role=cr&var-family=All something is up with eqsin or transport to there [15:05:20] hmm connectivity issue on eqsin? [15:05:33] I can ping 4 and 6 but curl fails [15:05:41] thanks -- laptop's still warming up, then I'll depool [15:05:45] okay just slow [15:05:58] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 21.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:06:01] there was also a big spike of appserver queries, although it didn't affect saturation and they were very fast queries [15:06:30] it responded much faster to a 2nd curl [15:06:38] RhinosF1: I made a phab task for that scap message earlier: https://phabricator.wikimedia.org/T313417 [15:06:45] (JobUnavailable) firing: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:18] (03PS1) 10Majavah: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815750 [15:07:20] incoming traffic drop on lvs5001 and lvs5002... https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=46&from=now-3h&to=now [15:07:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T312990)', diff saved to https://phabricator.wikimedia.org/P31528 and previous config saved to /var/cache/conftool/dbconfig/20220720-150730-marostegui.json [15:07:31] (03PS1) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751 [15:07:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:07:34] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [15:07:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1116.eqiad.wmnet with reason: Maintenance [15:07:50] higher latency through both transports so not related to transport link [15:07:59] (03CR) 10CDanis: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751 (owner: 10RLazarus) [15:08:19] (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:19] (ProbeDown) resolved: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:08:30] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 78.47 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [15:08:37] Lucas_WMDE: ack [15:08:45] (03PS2) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 [15:08:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance [15:09:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1114.eqiad.wmnet with reason: Maintenance [15:09:05] hm, might be recovered [15:09:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31529 and previous config saved to /var/cache/conftool/dbconfig/20220720-150908-marostegui.json [15:09:32] I haven't merged the depool yet, going to hold it but keep it ready [15:09:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [15:09:52] ^ that's just delayed [15:10:07] !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw [15:10:09] rzl: curl is instant compared to very slow when the page happened now [15:10:27] RhinosF1: is your traffic hitting eqsin or somewhere else? [15:10:59] cdanis: i just like being nosey and checked when i saw the page because i'm bored and have nothing better to do [15:11:25] rzl: https://i.imgur.com/dx0Ktg0.png [15:11:45] (JobUnavailable) resolved: Reduced availability for job swagger_check_restbase_eqsin in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:46] rzl: I kind of suspect that there was a spike of something served statically by the appservers but not cached on mostly eqsin, enough to saturate both transits? [15:12:06] hm! text but not upload, which makes that a little trickier [15:12:10] but still possible [15:12:11] cdanis: transports you mean? [15:12:14] XioNoX: yes sorry [15:12:16] transports [15:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31530 and previous config saved to /var/cache/conftool/dbconfig/20220720-151216-marostegui.json [15:12:27] rzl: XioNoX: https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=appserver&var-instance=All&var-datasource=thanos&from=1658328835477&to=1658329930726&viewPanel=84 [15:12:37] appserver cluster was txing 1.6 Gbyte/sec [15:13:15] (03PS3) 10Jbond: P:varnish::common: Add support for passing wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/815728 [15:13:17] wild [15:13:38] okay, will dig a little and see if I can figure out what that traffic was, but first coffee [15:13:50] XioNoX: just to verify, are you okay leaving eqsin pooled unless this comes back? [15:14:07] rzl: yeah everything is back to normal network wise [15:14:13] rad, thanks [15:14:18] https://librenms.wikimedia.org/device/device=159/tab=port/port=13968/ indeed some spikes on the transport links [15:15:26] the codfw-eqsin link lost its OSPF adjacency [15:16:19] very briefly, traffic routed through ulsfo [15:16:34] Jul 20 15:02:41 cr3-eqsin bfdd[16011]: BFD Session 103.102.166.139 (IFL 90) state Up -> Down LD/RD(171/2015) Up time:1w0d 19:19 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. [15:16:36] purged shows a 5 minutes lag around 15:00 - 15:05 [15:16:39] (on eqsin)( [15:16:49] XioNoX: fallout of saturating that link? [15:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fix db2167:3318', diff saved to https://phabricator.wikimedia.org/P31531 and previous config saved to /var/cache/conftool/dbconfig/20220720-151711-marostegui.json [15:17:16] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 1 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [15:17:22] cdanis: it's not common but could be yeah [15:20:17] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage [15:23:12] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2036.codfw.wmnet with reason: host reimage [15:26:01] PROBLEM - Disk space on dumpsdata1003 is CRITICAL: DISK CRITICAL - free space: /data 859341 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [15:26:08] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [15:26:09] !log jayme@cumin2002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-codfw [15:26:40] (03PS1) 10Giuseppe Lavagetto: php_exporter: only export the proper php version [puppet] - 10https://gerrit.wikimedia.org/r/815755 [15:27:05] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:27:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31532 and previous config saved to /var/cache/conftool/dbconfig/20220720-152721-marostegui.json [15:28:16] !log jayme@cumin2002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [15:31:17] jouncebot nowandnext [15:31:17] No deployments scheduled for the next 2 hour(s) and 28 minute(s) [15:31:17] In 2 hour(s) and 28 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800) [15:31:18] In 2 hour(s) and 28 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800) [15:31:58] I'm going to run a couple of `scap sync-wikiversions` tests [15:32:54] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T313337 (10Cmjohnson) I swapped the optics for both and cleaned fiber. [15:35:50] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [15:38:17] 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10Ladsgroup) If I can understand the "somehow" better, I might be able to help. Did you try changing the email in https://lists.wikimedia.org/accounts/email/? [15:39:06] 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) [15:39:19] 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) [15:39:37] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: testing [15:39:43] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:39:47] Done testing [15:39:59] 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) Racked and installed adapters. Adjusted racking location to U46 [15:40:06] 10SRE, 10ops-eqiad: Eqiad: patch panel and coupler installation in A1 and A8 - https://phabricator.wikimedia.org/T312895 (10Jclark-ctr) 05Open→03Resolved [15:41:29] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [15:42:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114', diff saved to https://phabricator.wikimedia.org/P31534 and previous config saved to /var/cache/conftool/dbconfig/20220720-154227-marostegui.json [15:43:39] (03CR) 10Hashar: [C: 03+1] "I thinks this change is good as is. Ahmon has a point we should aim at using the global generated known_hosts file under /etc , but I beli" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:45:37] (03CR) 10Ahmon Dancy: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:45:51] (03CR) 10Ahmon Dancy: [C: 03+1] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:46:08] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2036.codfw.wmnet with OS bullseye [15:46:16] (03PS1) 10JMeybohm: k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) [15:46:17] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye completed: - elastic2034 (**PAS... [15:46:20] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2036.codfw.wmnet with OS bullseye executed with errors: - elastic... [15:48:01] (03PS1) 10Giuseppe Lavagetto: wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759 [15:50:02] 10SRE, 10Wikimedia-Mailing-lists: Volunteer account erroneously linked with official email id - https://phabricator.wikimedia.org/T313321 (10RASharma_WMF) Hi, Id would mean the WMF email address. Currently when I try to log in, using my WMF email address, I am asked to verify (which I purposefully haven't)... [15:50:20] !log jayme@cumin2002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [15:50:29] \o/ [15:51:39] (03PS1) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729 [15:51:51] (03CR) 10Volans: [C: 03+1] "LGTM for the python side" [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [15:52:21] (03CR) 10Hashar: [C: 03+1] "+1 I would love gerrit_servers to be renamed ssh_allowed_hosts in a future change (see inline comment for rationale)." [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [15:52:27] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:53:33] (03PS2) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729 [15:53:51] (03PS1) 10Jbond: C:varnish: improve error messaging for reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/815761 [15:53:56] (03CR) 10CI reject: [V: 04-1] k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [15:54:45] (03PS2) 10Giuseppe Lavagetto: wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759 [15:54:57] (03PS3) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729 [15:55:41] PROBLEM - Check systemd state on elastic2036 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:47] !log dancy@deploy1002 Installing scap version "4.11.2" for 557 hosts [15:57:07] !log dancy@deploy1002 Installation of scap version "4.11.2" completed for 557 hosts [15:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1114 (T312990)', diff saved to https://phabricator.wikimedia.org/P31535 and previous config saved to /var/cache/conftool/dbconfig/20220720-155732-marostegui.json [15:57:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:57:37] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [15:57:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [15:57:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31536 and previous config saved to /var/cache/conftool/dbconfig/20220720-155752-marostegui.json [16:01:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31537 and previous config saved to /var/cache/conftool/dbconfig/20220720-160103-marostegui.json [16:01:37] (03Abandoned) 10Dduvall: Revert "Revert "gitlab_runner: Handle changes to runner config"" [puppet] - 10https://gerrit.wikimedia.org/r/815729 (owner: 10Dduvall) [16:01:56] (03PS2) 10JMeybohm: k8s: Adapt retry parameters to reality [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) [16:03:53] (03CR) 10JMeybohm: k8s: Adapt retry parameters to reality (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/815757 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [16:03:59] (03CR) 10Hashar: [C: 03+1] gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [16:05:45] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [16:05:51] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:08:29] RECOVERY - Check systemd state on elastic2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31538 and previous config saved to /var/cache/conftool/dbconfig/20220720-161608-marostegui.json [16:17:26] (03Abandoned) 10RLazarus: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/815751 (owner: 10RLazarus) [16:21:55] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [16:23:55] PROBLEM - Check systemd state on elastic2053 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318', diff saved to https://phabricator.wikimedia.org/P31539 and previous config saved to /var/cache/conftool/dbconfig/20220720-163113-marostegui.json [16:35:45] RECOVERY - Check systemd state on elastic2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:40:42] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [16:40:46] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:41:27] (03PS2) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 [16:43:16] (03PS3) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 [16:44:13] (03CR) 10JHathaway: beaker: add a method to hack fixes specific to beaker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814866 (owner: 10Jbond) [16:44:46] (03CR) 10Ayounsi: [C: 03+1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [16:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3318 (T312990)', diff saved to https://phabricator.wikimedia.org/P31540 and previous config saved to /var/cache/conftool/dbconfig/20220720-164618-marostegui.json [16:46:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance [16:46:24] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [16:46:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1109.eqiad.wmnet with reason: Maintenance [16:46:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31541 and previous config saved to /var/cache/conftool/dbconfig/20220720-164638-marostegui.json [16:47:01] (03CR) 10CI reject: [V: 04-1] transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 (owner: 10Volans) [16:47:27] jouncebot: nowandnext [16:47:27] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [16:47:27] In 1 hour(s) and 12 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800) [16:47:27] In 1 hour(s) and 12 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800) [16:47:59] borrowing mwdebug1001 to test an apache change, won't be long [16:48:54] (03CR) 10RLazarus: [C: 03+2] wwwportals: clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815759 (owner: 10Giuseppe Lavagetto) [16:49:20] !log rzl@cumin2002:~$ sudo cumin A:mw 'disable-puppet 815759' [16:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:27] (03PS4) 10Volans: transports.junos: fix upstream regression [software/homer] - 10https://gerrit.wikimedia.org/r/815724 [16:49:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31542 and previous config saved to /var/cache/conftool/dbconfig/20220720-164946-marostegui.json [16:53:02] correction, borrowing mwdebug1002 [16:59:41] (03PS1) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) [17:01:27] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:01:51] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:01:53] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:06] (03PS1) 10RLazarus: wwwportals: Actually clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815770 [17:04:23] (httpbb alerts are expected) [17:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31543 and previous config saved to /var/cache/conftool/dbconfig/20220720-170451-marostegui.json [17:05:49] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2048.codfw.wmnet with OS bullseye [17:05:55] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2048.codfw.wmnet with OS bullseye [17:07:49] (03CR) 10RLazarus: [C: 03+2] wwwportals: Actually clean up query string on www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/815770 (owner: 10RLazarus) [17:12:11] !log rzl@cumin2002:~$ sudo cumin A:mw 'enable-puppet 815759' [17:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:23] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:48] done [17:19:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P31544 and previous config saved to /var/cache/conftool/dbconfig/20220720-171956-marostegui.json [17:25:53] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2048.codfw.wmnet with reason: host reimage [17:28:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2048.codfw.wmnet with reason: host reimage [17:35:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T312990)', diff saved to https://phabricator.wikimedia.org/P31545 and previous config saved to /var/cache/conftool/dbconfig/20220720-173502-marostegui.json [17:35:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance [17:35:07] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [17:35:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1172.eqiad.wmnet with reason: Maintenance [17:35:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31546 and previous config saved to /var/cache/conftool/dbconfig/20220720-173522-marostegui.json [17:35:43] (03CR) 10Andrew Bogott: wmcs: don't page for most checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [17:35:59] (03PS1) 10Ssingh: durum: improve check frontend loading message [puppet] - 10https://gerrit.wikimedia.org/r/815771 [17:38:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31547 and previous config saved to /var/cache/conftool/dbconfig/20220720-173823-marostegui.json [17:38:30] !log bking@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2048.codfw.wmnet with OS bullseye [17:38:36] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2048.codfw.wmnet with OS bullseye executed with errors: - elastic... [17:38:38] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [17:38:41] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [17:39:39] (03CR) 10Ssingh: [C: 03+2] durum: improve check frontend loading message [puppet] - 10https://gerrit.wikimedia.org/r/815771 (owner: 10Ssingh) [17:45:06] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) [17:47:44] (03PS1) 10Andrew Bogott: Keystone: fix sync_time for copying fernet keys [puppet] - 10https://gerrit.wikimedia.org/r/815772 [17:48:51] (03CR) 10Jeena Huneidi: [C: 03+1] scap.cfg.erb: Set gerrit_push_user: trainbranchbot [puppet] - 10https://gerrit.wikimedia.org/r/815329 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:50:38] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: fix sync_time for copying fernet keys [puppet] - 10https://gerrit.wikimedia.org/r/815772 (owner: 10Andrew Bogott) [17:50:53] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:51:00] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:52:01] (03PS1) 10David Caro: wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 [17:52:47] (03PS5) 10Jdlrobson: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) [17:53:28] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro) [17:53:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31548 and previous config saved to /var/cache/conftool/dbconfig/20220720-175328-marostegui.json [17:54:51] (03CR) 10Andrew Bogott: [C: 03+1] "Once https://gerrit.wikimedia.org/r/c/operations/alerts/+/815773 is in place, I'm happy with this patch!" [puppet] - 10https://gerrit.wikimedia.org/r/813267 (owner: 10David Caro) [17:56:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:20] (03CR) 10David Caro: [C: 03+2] wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro) [18:00:05] jeena and jnuche: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800). [18:00:05] jeena and jnuche: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T1800). [18:00:48] (03Merged) 10jenkins-bot: wmcs: Add pages for cloudvirt nodes going down [alerts] - 10https://gerrit.wikimedia.org/r/815773 (owner: 10David Caro) [18:01:39] (03PS4) 10Ryan Kemper: Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [18:01:47] (03CR) 10Andrew Bogott: [C: 03+2] Revert "striker: Open firewall for Docker-managed service" [puppet] - 10https://gerrit.wikimedia.org/r/811274 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [18:03:07] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Revert "elastic: increase recovery time" [cookbooks] - 10https://gerrit.wikimedia.org/r/784724 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [18:04:31] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: cloudweb-dev: route striker to the docker port [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [18:06:44] (03CR) 10Hashar: [C: 03+1] "I completely forgot about that series of patch. Congratulations!" [puppet] - 10https://gerrit.wikimedia.org/r/815329 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [18:06:46] (03CR) 10Andrew Bogott: [C: 03+2] "Striker on cloudweb1002 is broken the same way after this patch as before, so... success?" [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [18:08:28] (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074) [18:08:30] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [18:08:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P31549 and previous config saved to /var/cache/conftool/dbconfig/20220720-180834-marostegui.json [18:09:24] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815775 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [18:12:57] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.21 refs T308074 [18:13:02] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [18:15:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:15:16] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [18:15:20] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [18:16:05] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.21 refs T308074 (duration: 03m 07s) [18:16:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:16:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:17:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:17:24] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2045.codfw.wmnet with OS bullseye [18:17:30] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye [18:18:25] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:22:04] (03PS1) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943) [18:23:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T312990)', diff saved to https://phabricator.wikimedia.org/P31550 and previous config saved to /var/cache/conftool/dbconfig/20220720-182339-marostegui.json [18:23:47] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [18:23:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2079.codfw.wmnet with reason: Maintenance [18:24:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2079.codfw.wmnet with reason: Maintenance [18:24:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on 15 hosts with reason: Maintenance [18:24:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on 15 hosts with reason: Maintenance [18:24:44] (03PS2) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943) [18:25:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance [18:25:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1111.eqiad.wmnet with reason: Maintenance [18:25:47] (03PS3) 10Ryan Kemper: elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943) [18:26:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:26:53] (03CR) 10Bking: [C: 03+2] elastic: bring 3 hosts in for extra capacity [puppet] - 10https://gerrit.wikimedia.org/r/815778 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [18:27:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1178.eqiad.wmnet with reason: Maintenance [18:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31551 and previous config saved to /var/cache/conftool/dbconfig/20220720-182710-marostegui.json [18:28:25] (03PS1) 10Majavah: P:toolforge::grid: add bash completion to exec-manage [puppet] - 10https://gerrit.wikimedia.org/r/815780 [18:29:54] (03PS2) 10Cwhite: logstash: enable pipeline-managed index patterns [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) [18:30:24] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:35:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2045.codfw.wmnet with OS bullseye [18:35:10] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2045.codfw.wmnet with OS bullseye executed with errors: - elastic... [18:36:48] (03PS1) 10Ebernhardson: apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 [18:36:50] (03PS1) 10Ebernhardson: apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782 [18:36:52] (03PS1) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 [18:36:54] (03PS1) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 [18:37:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [18:37:08] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [18:37:13] (03PS1) 10Ryan Kemper: elastic: add rack info for 3 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/815785 (https://phabricator.wikimedia.org/T300943) [18:38:47] (03CR) 10Bking: [C: 03+2] elastic: add rack info for 3 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/815785 (https://phabricator.wikimedia.org/T300943) (owner: 10Ryan Kemper) [18:49:26] (03Abandoned) 10Ebernhardson: Remove i18n and IS references to ApiFeatureUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814873 (https://phabricator.wikimedia.org/T313248) (owner: 10Ebernhardson) [18:52:48] (03PS1) 10Andrew Bogott: OpenStack Nova: Allow duplicate VM names in different projects. [puppet] - 10https://gerrit.wikimedia.org/r/815787 (https://phabricator.wikimedia.org/T305831) [18:53:45] 10SRE, 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [18:59:28] (03PS2) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) [19:00:19] (03CR) 10CI reject: [V: 04-1] gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [19:00:30] PROBLEM - Check systemd state on elastic2062 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_6@production-search-codfw.service,elasticsearch_6@production-search-psi-codfw.service,prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:51] (03CR) 10Dduvall: "Jelto, I've attempted a refactor here to: 1) hopefully simplify the approach; 2) properly re-configure existing runners; the prior patch c" [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) (owner: 10Dduvall) [19:04:41] (03PS3) 10Dduvall: gitlab_runner: Handle changes to runner config [puppet] - 10https://gerrit.wikimedia.org/r/815769 (https://phabricator.wikimedia.org/T311746) [19:07:17] jeena: jnuche: hey, can we rollback group1 due to T313432? commons uploading interfaces (Special:Upload and Special:UploadWizard) are completely broken at least for me [19:07:18] T313432: Error: Call to a member function getConfig() on null - https://phabricator.wikimedia.org/T313432 [19:07:47] okay, I'll roll back [19:07:54] thanks [19:08:05] I'm looking if I can find any obvious causes for that [19:08:13] thanks :) [19:09:00] (03PS1) 10Ssingh: durum: CSP default-src should be none [puppet] - 10https://gerrit.wikimedia.org/r/815789 [19:09:43] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36332/console" [puppet] - 10https://gerrit.wikimedia.org/r/815789 (owner: 10Ssingh) [19:12:46] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: CSP default-src should be none [puppet] - 10https://gerrit.wikimedia.org/r/815789 (owner: 10Ssingh) [19:13:06] found the cause, let's see if I can fix it easily [19:13:26] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group[0|1] wikis to [VERSION]" [19:13:46] RECOVERY - Check systemd state on elastic2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:47] ugh sorry for messing up the message [19:15:29] !log that should be revert group1 wikis to 1.39.0-wmf.19 [19:16:19] you can still edit on the wiki if you want [19:16:47] but not on twitter, mastodon, or sal.toolforge.org :) [19:17:02] 💀 [19:17:08] (potential) fix is up on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/3D/+/815790/ [19:17:15] * bd808 believes he typos more !log messages than not [19:19:01] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) If you've still got the perf base of the flamegraph, is a possible to get ` perf report --no-children --stdio -i inp... [19:19:56] (03PS2) 10Ebernhardson: apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 (https://phabricator.wikimedia.org/T313434) [19:19:58] (03PS2) 10Ebernhardson: apifeatureusage: Adjust index template to use _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815782 (https://phabricator.wikimedia.org/T313434) [19:20:00] (03PS2) 10Ebernhardson: apifeatureusage: Temporarily remove index template during 6->7 transition [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434) [19:20:02] (03PS2) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) [19:20:19] (03PS1) 10Jeena Huneidi: Revert "group1 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) [19:20:21] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.21" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:20:26] (03PS1) 10Ladsgroup: PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432) [19:20:32] thanks Amir1 [19:20:39] who wants to deploy a backport? [19:20:46] thank you for the patch [19:20:53] where is jeena [19:21:01] Can I? [19:21:15] I just need to merge this config patch [19:22:19] (03PS2) 10Jeena Huneidi: Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) [19:22:34] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:23:01] Amir1: all good [19:23:16] (03CR) 10Ladsgroup: [C: 03+2] PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432) (owner: 10Ladsgroup) [19:23:30] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.19" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:24:54] (03CR) 10JHathaway: "Running:" [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [19:25:18] (03Merged) 10jenkins-bot: PatentFormField: pass on $this->mParent to HTMLRadioField constructor [extensions/3D] (wmf/1.39.0-wmf.21) - 10https://gerrit.wikimedia.org/r/815733 (https://phabricator.wikimedia.org/T313432) (owner: 10Ladsgroup) [19:25:30] that was fast [19:26:40] taavi: now you can't test it I guess? [19:26:54] jeena: now that group1 is back wmf.19 we can't see if it's fixed [19:27:05] I can roll forward if you like [19:27:06] I can manually hack commons to .21 on a mwdebug box [19:27:20] taavi: nah, let's push and move forward [19:27:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31552 and previous config saved to /var/cache/conftool/dbconfig/20220720-192724-marostegui.json [19:27:30] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [19:27:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:28:01] should I go ahead? [19:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:28:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:29:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:30:21] (03CR) 10Jeena Huneidi: [C: 03+2] "I got confused by the title and changed the commit message... So the original message Revert "group1 wikis to 1.39.0-wmf.21" was correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815792 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:30:59] jeena: the sync will finish in a sec [19:32:24] I have some important stuff (not risky, important) being shipped in wmf.21, I really want to see it done [19:32:36] The last pieces of templatelinks normalization [19:32:45] :) [19:33:57] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.21/extensions/3D/src/PatentFormField.php: Backport: [[gerrit:815733|PatentFormField: pass on $this->mParent to HTMLRadioField constructor (T313432)]] (duration: 03m 08s) [19:34:01] T313432: Error: Call to a member function getConfig() on null - https://phabricator.wikimedia.org/T313432 [19:34:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:35:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:35:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:36:20] jeena: done, feel free to move ahead [19:36:38] thanks Amir1 & taavi [19:36:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:36:58] (03PS1) 10Jeena Huneidi: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074) [19:37:04] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:37:08] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) [19:38:33] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815793 (https://phabricator.wikimedia.org/T308074) (owner: 10Jeena Huneidi) [19:40:41] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) Notes based on IRC discussion on #wikimedia-traffic: * We only want to apply query sorting to text requests for now, because we ca... [19:41:28] fix confirmed working [19:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:42:07] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.21 refs T308074 [19:42:10] T308074: 1.39.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T308074 [19:42:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31553 and previous config saved to /var/cache/conftool/dbconfig/20220720-194229-marostegui.json [19:42:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:42:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:43:28] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) also see T312194#8092388 We now have working checks. Here you can see how it is working: https://grafana-rw.wikimedia.org/explore... [19:43:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:45:00] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.21 refs T308074 (duration: 02m 53s) [19:45:22] (03CR) 10Dzahn: [C: 03+2] acme_chief: add gerrit2002 to hosts allowed to fetch TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/815397 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:46:07] (03CR) 10Andrea Denisse: netmon: Add suppport for multiple backup/passive nodes in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:48:29] (03CR) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [19:48:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:49:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:49:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:50:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:51:33] (03PS1) 10Ladsgroup: wwwportals: Make sure portal assets are also visible in wikiquote vhost [puppet] - 10https://gerrit.wikimedia.org/r/815794 (https://phabricator.wikimedia.org/T273179) [19:53:14] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [19:53:20] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [19:54:55] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2032.codfw.wmnet with OS bullseye [19:55:02] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2032.codfw.wmnet with OS bullseye [19:57:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P31554 and previous config saved to /var/cache/conftool/dbconfig/20220720-195734-marostegui.json [20:00:05] RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220720T2000). [20:00:05] MatmaRex and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] i can deploy o/ (since i'm covering for Jon's patch) [20:00:41] o/ [20:00:44] I'm also here if needed [20:00:55] hi [20:01:08] hi MatmaRex - let's do it [20:01:14] (03PS2) 10Clare Ming: Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński) [20:03:24] thanks urbanecm! hopefully you won't be needed [20:04:05] (03CR) 10Clare Ming: [C: 03+2] Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński) [20:04:49] (03Merged) 10jenkins-bot: Enable DiscussionTools visualenhancements as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/815359 (https://phabricator.wikimedia.org/T312670) (owner: 10Bartosz Dziewoński) [20:06:01] MatmaRex: ur patch is up on mwdebug1002 - is it testable? [20:06:10] yeah, looking [20:06:40] (03CR) 10Cwhite: [C: 03+1] apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [20:06:49] (03PS2) 10Dzahn: gerrit: add gerrit2002 to firewall rules for cluster support [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) [20:07:29] @cjming I'm around after all if you need some help with testing [20:07:53] cool - thanks Jdlrobson [20:07:56] cjming: looks good [20:08:01] fabu - syncing [20:08:14] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2032.codfw.wmnet with reason: host reimage [20:08:52] (03PS6) 10Clare Ming: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:08:55] (03CR) 10Cwhite: [C: 03+1] "Expect logstash will be restarted by puppet when this gets deployed." [puppet] - 10https://gerrit.wikimedia.org/r/815783 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [20:10:25] (03CR) 10Jbond: beaker: add initial beaker files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [20:10:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:11:21] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815359|Enable DiscussionTools visualenhancements as beta feature on partner wikis (T312670)]] (duration: 03m 10s) [20:11:25] T312670: [Config Change] Enable Topic Containers as beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T312670 [20:11:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:11:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:11:45] MatmaRex: can you verify on prod? there were some issues syncing yesterday [20:11:50] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2032.codfw.wmnet with reason: host reimage [20:12:00] ok [20:12:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:12:26] (03CR) 10JHathaway: [C: 03+1] "Patch looks good, I'm curious why it took so long to run on my box, could be an issue with podman." [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [20:12:32] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T312990)', diff saved to https://phabricator.wikimedia.org/P31555 and previous config saved to /var/cache/conftool/dbconfig/20220720-201240-marostegui.json [20:12:44] T312990: Add columns geo_tags.gt_lat_int/gt_lon_int to unify schema on wmf wikis - https://phabricator.wikimedia.org/T312990 [20:14:05] cjming: is the change deployed now? i don't see the expected effect when not using mwdebug [20:14:30] hmm - should be [20:14:48] if you could confirm [20:14:54] dancy: if you're around, do i still need to double sync? [20:14:54] you need to enable "Discussion tools" at https://ar.wikipedia.org/wiki/خاص:تفضيلات?uselang=en#mw-prefsection-betafeatures [20:15:03] and then visit https://ar.wikipedia.org/wiki/نقاش:الصفحة_الرئيسية [20:15:18] cjming: That problem should be fixed now. [20:15:20] each heading should have some metadata added underneath it, in grey text [20:16:09] i am only seeing the change inconsistently whenever i reload the page [20:16:19] so this seems related to issues we've had, uhh, a couple weeks ago? [20:16:26] That does seem to imply the same type of syncing problem. [20:16:28] where changes didn't take effect on some servers [20:16:59] dancy: should i sync again just to be sure? [20:17:15] lemme see if I can dig up some evidence first. [20:17:44] great - thanks [20:20:40] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:20:44] MatmaRex: i don't really know what i'm looking at - i'm looking for diffs but not sure what is expected on the page you linked [20:21:22] one sec [20:22:22] current (bad): https://phabricator.wikimedia.org/F35326992 expected (good): https://phabricator.wikimedia.org/F35326993 [20:22:26] actually now i think i see it [20:22:33] note the different font and the text line below heading [20:23:07] the same effect should appear on any talk page (this link is the talk of the main page) [20:23:49] Clare please resync [20:23:54] alrighty [20:27:15] I assume there were no interesting messages during the first sync [20:27:30] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:815359|Enable DiscussionTools visualenhancements as beta feature on partner wikis (T312670)]] (duration: 03m 26s) [20:27:34] T312670: [Config Change] Enable Topic Containers as beta feature at partner wikis (desktop) - https://phabricator.wikimedia.org/T312670 [20:28:53] dancy: not that i recall - and just now i resync'd but i'm still not seeing the expected change in one of my tabs [20:29:22] fwiw resync seemed suspiciously fast [20:29:34] How long? [20:29:39] (for the php-fpm-restart phase) [20:30:27] i'm seeing the expcted result now, over several refreshes [20:30:27] It should be around 2 minutes [20:30:29] it says 2m 42s [20:30:40] ok that's a normal duration [20:30:43] MatmaRex: great! [20:30:47] maybe we're good [20:30:49] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:04] So we still have the same problem. [20:31:08] I'll reopen the ticket [20:31:09] bummer [20:31:52] ok - I guess i'll move on then [20:31:59] and resync if needed [20:32:29] (03CR) 10Clare Ming: [C: 03+2] Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:32:57] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2032.codfw.wmnet with OS bullseye [20:33:02] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2032.codfw.wmnet with OS bullseye completed: - elastic2032 (**WAR... [20:33:15] (03Merged) 10jenkins-bot: Deploy the new grid layout to group 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814906 (https://phabricator.wikimedia.org/T312241) (owner: 10Jdlrobson) [20:33:54] thanks [20:34:31] and thanks for double-checking cjming, i also assumed that this would be fixed [20:34:35] np! [20:34:47] It was supposed to be! [20:36:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/36334/gerrit1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/815398 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:37:36] firewall change on gerrit... incoming... [20:37:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:06] ferm reloaded. gerrit still up. go on :) [20:38:35] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814906|Deploy the new grid layout to group 1 (T312241)]] (duration: 03m 14s) [20:38:38] T312241: Deploy the new grid layout - https://phabricator.wikimedia.org/T312241 [20:38:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:06] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad row C switch fabric recabling - https://phabricator.wikimedia.org/T313384 (10wiki_willy) a:03Jclark-ctr [20:39:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:40:04] cjming: I see grid on debug1001 on group 1 wikis so I think that's good to sync [20:40:17] PROBLEM - Check systemd state on elastic2032 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:41:19] Jdlrobson: sounds good - i'm actually resyncing bec i didn't see grid on itwiki - can you verify on prod? [20:41:25] (03CR) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:41:31] (03PS3) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) [20:41:40] I was looking at Hebrew [20:41:48] I see it on Italian too thoug [20:42:01] oh good [20:42:36] (03CR) 10Dzahn: gerrit: add gerrit2002 to puppetized known_hosts file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [20:42:59] and i see it now so i think we're good [20:43:00] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:814906|Deploy the new grid layout to group 1 (T312241)]] (duration: 03m 16s) [20:45:19] shutting it down early -- if someone needs something in the next few, just give me a poke [20:45:27] !log end of UTC late backport window [20:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:23] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/pcc-worker1002/36331/" [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [20:53:55] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [20:56:23] RECOVERY - Check systemd state on elastic2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:07] (03CR) 10Dzahn: "compiling this shows this as noop but that's because only the "homedir" is a puppet resource and it has "recurse => 'remote'"" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [21:13:51] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:39] (03CR) 10Dzahn: [C: 03+2] gerrit: add gerrit2002 to puppetized known_hosts file [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [21:19:04] cjming: Are you still around? I'd like to look at the transcripts of the deployments you did today to see if I can draw any conclusions. [21:19:37] dancy: yup -- i'll see if i can dig them up [21:19:39] (03CR) 10Dzahn: [C: 04-1] gerrit: add hiera data for a second replica (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [21:21:32] (03PS1) 10Cwhite: logstash: add missing closing curly brace [puppet] - 10https://gerrit.wikimedia.org/r/815799 (https://phabricator.wikimedia.org/T305175) [21:22:04] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: add missing closing curly brace [puppet] - 10https://gerrit.wikimedia.org/r/815799 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite) [21:24:46] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10ori) It looks like the maximum rate at which swift-object-expirer will issue deletes is configurable via [[ https://github.com/op... [21:24:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:29:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [21:33:28] (03CR) 10Dzahn: [C: 03+1] "yea, so I don't really know much here (how to test it, when the previous check was added) but let me say I have no concerns if you just do" [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:34:48] (03CR) 10Dzahn: [C: 03+1] phabricator: switch to prometheus-only network probes/checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:37:52] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/805815 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:39:44] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10nskaggs) [21:41:44] (03CR) 10Cwhite: [C: 03+1] apifeatureusage: Write using the _doc mapping type [puppet] - 10https://gerrit.wikimedia.org/r/815781 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [21:41:50] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): asw2-c5-eqiad crash - https://phabricator.wikimedia.org/T313382 (10nskaggs) >>! In T313382#8090176, @Marostegui wrote: > - dbproxy1018 and dbproxy1019 are active WMCS proxies, need to be handled by them cc @nskaggs (they should... [21:42:23] (03CR) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [21:43:59] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: Move dbproxy1019 from C5 to B6 - https://phabricator.wikimedia.org/T313445 (10JJMC89) [21:45:21] !log bking@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [21:45:27] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [21:46:36] (03CR) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [21:52:55] (03CR) 10Cwhite: "Following up from Keith's comment, one possible solution to the orphaned configs." [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [21:53:34] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [21:53:40] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [21:54:54] (03PS2) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176 [21:57:46] (03CR) 10Dzahn: "a bit of rebasing hell due to other changes but fixing it" [puppet] - 10https://gerrit.wikimedia.org/r/807176 (owner: 10Dzahn) [22:12:39] (03PS3) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176 [22:16:08] (03PS4) 10Dzahn: prometheus::blackbox::http: add/edit parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/807176 [22:18:46] (03CR) 10BryanDavis: [V: 03+1] hieradata: cloudweb-dev: route striker to the docker port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/811332 (https://phabricator.wikimedia.org/T306469) (owner: 10Majavah) [22:23:24] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Ladsgroup) Sorry it took me a bit to get it done: - https://people.wikimedia.org/~ladsgroup/mariadb_flamegraphs/nochildern.superbu... [22:24:33] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:26:21] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:54] 10SRE, 10Znuny, 10serviceops, 10serviceops-collab, 10Sustainability (Incident Followup): enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Dzahn) here is a screenshot that shows how to get this on https://grafana-rw.wikimedia.org {F35327060} [22:36:25] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:59] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:23] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: IPMI failures for elastic20[67, 68, 70, 71, 72] - https://phabricator.wikimedia.org/T313369 (10RKemper) 05Open→03Resolved >>! In T313369#8091365, @Volans wrote: > Updated the comment above as I made the command safer directly in the docs :) Thanks! I fol... [23:07:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2067.codfw.wmnet with OS bullseye [23:07:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2068.codfw.wmnet with OS bullseye [23:10:09] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2070.codfw.wmnet with OS bullseye [23:10:10] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2071.codfw.wmnet with OS bullseye [23:10:11] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2072.codfw.wmnet with OS bullseye [23:11:55] !log T300943 Fixed IPMI passwords for elastic `20[67,68,70,71,72]`, reimaging them to bullseye (these hosts are not in service, thus the batch operation) [23:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:00] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [23:22:05] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2068.codfw.wmnet with reason: host reimage [23:22:12] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2067.codfw.wmnet with reason: host reimage [23:24:13] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2070.codfw.wmnet with reason: host reimage [23:24:21] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2072.codfw.wmnet with reason: host reimage [23:24:29] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2071.codfw.wmnet with reason: host reimage [23:24:49] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2068.codfw.wmnet with reason: host reimage [23:28:21] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2071.codfw.wmnet with reason: host reimage [23:29:44] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on elastic2067.codfw.wmnet with reason: host reimage [23:29:52] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2070.codfw.wmnet with reason: host reimage [23:32:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2072.codfw.wmnet with reason: host reimage [23:34:19] (03PS5) 10Jdlrobson: Deploy grid to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814907 (https://phabricator.wikimedia.org/T312241) [23:37:04] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:38:48] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2068.codfw.wmnet with OS bullseye [23:42:33] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2071.codfw.wmnet with OS bullseye [23:43:55] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2070.codfw.wmnet with OS bullseye [23:44:13] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2067.codfw.wmnet with OS bullseye [23:46:40] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) I'm working on T279664. Active/active multi-DC mode for MediaWiki is coming very soon. About a month ago I did a quick review of multi-DC support in the... [23:47:20] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2072.codfw.wmnet with OS bullseye [23:49:58] (03PS8) 10Fomafix: Add language codes sr-cyrl and sr-latn next to sr-ec and sr-el [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [23:52:00] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:51] (03PS14) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [23:57:27] 10SRE-swift-storage: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself - https://phabricator.wikimedia.org/T313102 (10tstarling) T201858 contains a generous dose of clue. Gilles said "I suspect that making thumbnail traffic active/active might actually require less effort than the...